knowledge-database (beta)

Current group: comp.compilers

virtual machine efficiency

virtual machine efficiency  
ed_davis2
 Re: virtual machine efficiency  
Chris Dollin
 Re: virtual machine efficiency  
cr88192
 Re: virtual machine efficiency  
Chris Dollin
 Re: virtual machine efficiency  
Lars Duening
 Re: virtual machine efficiency  
John R. Strohm
 Re: virtual machine efficiency  
Rafael 'Dido' Sevilla
 Re: virtual machine efficiency  
Calum Grant
 Re: virtual machine efficiency  
cr88192
 Re: virtual machine efficiency  
Anton Ertl
 Re: virtual machine efficiency  
VBDis
 Re: virtual machine efficiency  
Chris F Clark
 Re: virtual machine efficiency  
cr88192
 Re: virtual machine efficiency  
cr88192
From:ed_davis2
Subject:virtual machine efficiency
Date:29 Dec 2004 01:49:17 -0500
I have developed a simple stack based virtual machine. I would like
it to be as efficient as possible, in terms of both speed and object
code size. Size, as in size of the object code the VM processes
(byte-code), and not necessarily the size of the VM.

But I've run into one of those situations where I can't figure out how
to have both.

All of the opcodes are one byte in size. Given the instruction:

push address

'push' takes one byte, and the address takes 4 bytes. If I pack
the code into a byte array, this takes 5 bytes. However, now
that means that address isn't on a word boundary. If I load
'address' using:

unsigned char *code;

operand = *(int *)code;

I incur a speed hit on many processors, and it may even fail on
others, since code probably isn't suitably aligned.

But if I use:

memcpy(&operand, code, sizeof(operand));

or

operand =
(code[pc]) |
(code[pc + 1] << 8) |
(code[pc + 2] << 16) |
(code[pc + 3] << 24);

I don't incur the miss-aligned speed hit, but it is nowhere as
fast as "operand = *(int *)code;" could be, if code were suitably
aligned.

I could make all opcodes a word in size, and make the code array an
array of integers. But that would dramatically increase the size of
the byte-code required by the VM.

Is there a way I can have both fast loading of operands, and
small byte-code size?
[Well, I suppose you could do halfword alignment, or you could split
the code into two separate sections, a list of bytes for the opcodes,
and a list of words for the operands. -John]
From:Chris Dollin
Subject:Re: virtual machine efficiency
Date:12 Jan 2005 22:59:08 -0500
ed_davis2 wrote:

> I have developed a simple stack based virtual machine. I would like
> it to be as efficient as possible, in terms of both speed and object
> code size. Size, as in size of the object code the VM processes
> (byte-code), and not necessarily the size of the VM.

(snips)

The last VM I built didn't allow you to embed large values into the
opcode stream. Instead each procedure had an associated constant
table, and you referred to large values by their index in the table.

Normally for loading big constants there was an eight-bit index field
(all instructions were 16 bits wide, with a 5+3 opcode, where the 3
bits might be an additional immediate value, or a sub-opcode), but
there was provision for handling the contextually-rare cases where
there were more than 256 big constants.

16-bit instructions turned out to be an interesting and occasionally
frustrating choice, as was the decision to try and combine stack-based
and register-aka-local-variable instructions. That the machine had
*two* stacks [1] - one for values, one for locals/return addresses - was
an automatic consequence of umpteen years writing Pop11 (and similar)
code.

[1] Well, two per thread.
--
Chris "electric hedgehog" Dollin
From:cr88192
Subject:Re: virtual machine efficiency
Date:14 Jan 2005 00:40:06 -0500
"Chris Dollin" wrote in message
> ed_davis2 wrote:
>> I have developed a simple stack based virtual machine. I would like
>> it to be as efficient as possible, in terms of both speed and object
>> code size. Size, as in size of the object code the VM processes
>> (byte-code), and not necessarily the size of the VM.
>
> (snips)
>
> The last VM I built didn't allow you to embed large values into the
> opcode stream. Instead each procedure had an associated constant
> table, and you referred to large values by their index in the table.

similar to mine.

> Normally for loading big constants there was an eight-bit index field
> (all instructions were 16 bits wide, with a 5+3 opcode, where the 3
> bits might be an additional immediate value, or a sub-opcode), but
> there was provision for handling the contextually-rare cases where
> there were more than 256 big constants.

my opcodes are based on words, though a little different:
6.10: length, opcode.
followed by any indices or addresses.

though in most cases optional, sometimes the length field has relevance...

> 16-bit instructions turned out to be an interesting and occasionally
> frustrating choice, as was the decision to try and combine stack-based
> and register-aka-local-variable instructions. That the machine had
> *two* stacks [1] - one for values, one for locals/return addresses - was
> an automatic consequence of umpteen years writing Pop11 (and similar)
> code.

mine also has two stacks:
the value stack;
and a kind of opaque meta-stack used for control.

locals were stored in their own space which is indirectly referenced by the
control stack (variables and similar are accessed with their own opcodes).


I like my approach personally (if albeit it uses a little extra space
sometimes).
From:Chris Dollin
Subject:Re: virtual machine efficiency
Date:15 Jan 2005 20:52:10 -0500
cr88192 wrote:

> "Chris Dollin" wrote in message

>> 16-bit instructions turned out to be an interesting and occasionally
>> frustrating choice, as was the decision to try and combine stack-based
>> and register-aka-local-variable instructions. That the machine had
>> *two* stacks [1] - one for values, one for locals/return addresses - was
>> an automatic consequence of umpteen years writing Pop11 (and similar)
>> code.

> mine also has two stacks: the value stack; and a kind of opaque
> meta-stack used for control.

> locals were stored in their own space which is indirectly referenced
> by the control stack (variables and similar are accessed with their
> own opcodes).

It's important to note that the two stacks in my VM are decoupled and
not just a convenience - the value stack doesn't get cut back just
because you return from a procedure, for example, and the value stack
can grow arbitrarily large during a procedure's execution.

It's not quite as free-and-easy as Pop11's open stack, partly because
one of the points of designing Spice [no, not that one] was to get
most of the benefits of the open-stack model without the gotchas -
although most of the constraints are imposed by code generation rather
than by the instruction set. (Although one of the procedure-entry
instructions explicitly transfers values off the value stack into the
control stack.)

--
Chris "electric hedgehog" Dollin
From:Lars Duening
Subject:Re: virtual machine efficiency
Date:30 Dec 2004 01:06:13 -0500
ed_davis2 wrote:

> [push address] takes one byte, and the address takes 4 bytes.
> If I pack the code into a byte array, this takes 5 bytes. However,
> now that means that address isn't on a word boundary. If I load
> 'address' using:
>
> unsigned char *code;
>
> operand = *(int *)code;
>
> I incur a speed hit on many processors, and it may even fail on
> others, since code probably isn't suitably aligned.

As a first step I would measure how much of a speed penalty you
actually incur by having to extract the adress data from out of the
bytecode (and yes, you will have to extract it for proper alignment).

Especially if your compiler recognizes memcpy() and compiles it into
optimized code, you might find that for example crossing the line
boundaries in your processor's L1 cache has greater impact than the
lack of alignment of single operands.

--
Lars Duening; lars@bearnip dot com
From:John R. Strohm
Subject:Re: virtual machine efficiency
Date:1 Jan 2005 18:02:44 -0500
"ed_davis2" said:
> I have developed a simple stack based virtual machine. I would like
> it to be as efficient as possible, in terms of both speed and object
> code size. Size, as in size of the object code the VM processes
> (byte-code), and not necessarily the size of the VM.

This thread has been running for a while, and a key point seems not to
have arisen.

The efficiency of the interpreter IN PRACTICE will depend HEAVILY on
the actual instruction stream fed to it.

If almost every instruction is a LOAD or STORE that specifies an
absolute address, then efficiency of the byte-code engine will be
dominated by efficiency of the absolute load and store primitives.
However, if the application that is running on the system is, say, a
pixel-cruncher, or a radar signal processor doing massive FFTs, the
absolute load and store primitives are not going to contribute very
much at all to the efficiency of the engine, AND IT IS A WASTE OF
PROGRAMMER BRAIN CYCLES TO OPTIMIZE THE COLD SPOTS.

In general, it is impossible to optimize everything absolutely.

What I would suggest you consider is adding some "index registers" or
"address registers" to your interpreter, and add "load address and
load" and "load address and store" primitives. These instructions
would load the target absolute address into one of the address
registers, and then load or store the top-of-stack through the address
register. Maybe you'd want to provide for an offset, absolute, or
even indexed.

Then whatever generates code to run on the virtual machine can
optimize e.g. constant and temporary placement, to get leverage off
the address registers and the indexed indirect load/store operations.
[Unless you plan to use a JIT compiler to translate your VM code into
machine code, I'd think the higher level the operators, the better,
to minimize the number of ops to decode. -John]
From:Rafael 'Dido' Sevilla
Subject:Re: virtual machine efficiency
Date:30 Dec 2004 00:55:59 -0500
One thing you may want to consider is altering your architecture from
a stack-based one to a memory transfer architecture. The Dis virtual
machine used on the Inferno operating system is one example. This
paper by Rob Pike and Phil Winterbottom gives some of the basic ideas:

http://www.vitanuova.com/inferno/papers/hotchips.html

It's something worth exploring I think.
From:Calum Grant
Subject:Re: virtual machine efficiency
Date:30 Dec 2004 22:48:02 -0500
ed_davis2 wrote:
> I have developed a simple stack based virtual machine. I would like
> it to be as efficient as possible, in terms of both speed and object
> code size. Size, as in size of the object code the VM processes
> (byte-code), and not necessarily the size of the VM.

> Is there a way I can have both fast loading of operands, and small
> byte-code size?

1) You could inserts nops before the op-code to ensure that the value
was always word-aligned. However this would be too slow since your
interpreter would need to handle the nops.

2) You could still align your data on a word-boundary. The interpreter
would notice the alignment of the op-code, and implicitly skip 0-3
bytes. The compiler would need to insert appropriate padding to ensure
the correct alignment.

case OP_LD:
if(IP&3) IP = (IP&~3) + 4;
operand = *(int*)IP;

3) You could have variable-width op-codes. Have 4 load instructions
with widths 1,2,3 and 4 that align the data following the op-code. But
I don't see much advantage over 2).

case OP_LD0:
operand = *(int*)IP;
IP += 4;
...
break;
case OP_LD1:
IP ++;
operand = *(int*)IP;
IP += 4;
...
break;
case OP_LD2:
IP += 2;
operand = *(int*)IP;
IP += 4;
...
break;
case OP_LD3:
IP += 3;
operand = *(int*)IP;
IP += 4;
...
break;

4) You could store values out-of-stream. But this is slightly less
efficient.

5) (copied from Anton Ertl's post) reorder op-codes and move
word-aligned data away from its op-code.

I think in the cases of 2) and 3), although there is occasional
padding, the interpreter has much less to do than in the more complex
schemes.

Calum
From:cr88192
Subject:Re: virtual machine efficiency
Date:31 Dec 2004 13:06:27 -0500
"Calum Grant" wrote in message
> ed_davis2 wrote:
>> I have developed a simple stack based virtual machine. I would like
>> it to be as efficient as possible, in terms of both speed and object
>> code size. Size, as in size of the object code the VM processes
>> (byte-code), and not necessarily the size of the VM.
>
>> Is there a way I can have both fast loading of operands, and small
>> byte-code size?
>
> 1) You could inserts nops before the op-code to ensure that the value
> was always word-aligned. However this would be too slow since your
> interpreter would need to handle the nops.

yes.
maybe useful for hw, I doubt so for interpreters...

> 2) You could still align your data on a word-boundary. The interpreter
> would notice the alignment of the op-code, and implicitly skip 0-3
> bytes. The compiler would need to insert appropriate padding to ensure
> the correct alignment.
>
> case OP_LD:
> if(IP&3) IP = (IP&~3) + 4;
> operand = *(int*)IP;

yes.

I typically realign slightly differently, eg:
IP=(IP+3)&(~3);
the check seems like a slight timewaster, it is imo better just to realign
and not worry about it than check and realign, as presumably the realign
would be needed in 75% of the calls anyways.

> 3) You could have variable-width op-codes. Have 4 load instructions
> with widths 1,2,3 and 4 that align the data following the op-code. But
> I don't see much advantage over 2).

yes, I don't see an advantage either.

> case OP_LD0:
> operand = *(int*)IP;
> IP += 4;
> ...
> break;
> case OP_LD1:
> IP ++;
> operand = *(int*)IP;
> IP += 4;
> ...
> break;
> case OP_LD2:
> IP += 2;
> operand = *(int*)IP;
> IP += 4;
> ...
> break;
> case OP_LD3:
> IP += 3;
> operand = *(int*)IP;
> IP += 4;
> ...
> break;
>
> 4) You could store values out-of-stream. But this is slightly less
> efficient.

not sure why really.
I would guess this depends on the implementation of the interpreter, and
mine typically being unoptimized enough to where it shouldn't matter (of all
things the interpreter looks up binding dynamically).

to me though this still seems like a "splitting hairs case":
p=table[ip[1]];
vs, eg:
i=*(int *)(ip+1);

or:
ip1=(byte *)(((long)ip+4)&(~3));
i=*(int *)ip;

I personally see other strong reasons for keeping values out of
stream, with the only main detractor I can think of being that one has
to keep tract of the out of band values table...

> 5) (copied from Anton Ertl's post) reorder op-codes and move
> word-aligned data away from its op-code.

imo this idea is cool, but I can imagine it being difficult to work with.
it would likely make a lot more sense, eg, if one could "packet" opcodes,
but imo this doesn't make much sense in an interpreter (it being serial and
all).

> I think in the cases of 2) and 3), although there is occasional
> padding, the interpreter has much less to do than in the more complex
> schemes.

personally I chose something like 4, imo it makes the most sense, eg,
in the dynamicly typed and dynamically compiled case, and the case
where opcodes don't really have associated types.


I think an important point is whether one views it as more sensible to have
a generic "load" operator, or "load.i2", "load.i4", "load.f8", ...


but whatever, I may just be stupid.
From:Anton Ertl
Subject:Re: virtual machine efficiency
Date:30 Dec 2004 00:57:29 -0500
"ed_davis2" writes:
>I have developed a simple stack based virtual machine. I would like
>it to be as efficient as possible, in terms of both speed and object
>code size. Size, as in size of the object code the VM processes
>(byte-code), and not necessarily the size of the VM.
>
>But I've run into one of those situations where I can't figure out how
>to have both.

You can't. The fastest techniques (read some of my papers) use more
memory than some other techniques. However, there are some techniques
that are good compromises between speed and space consumption; read:

@InProceedings{proebsting95,
author = "Todd A. Proebsting",
title = "Optimizing an {ANSI~C} Interpreter with Superoperators",
crossref = "popl95",
pages = "322--332",
annote = "Interpreter performance is optimized by combining
operators during code generation, when they are
still organized as trees. So a different, optimized
interpreter
is used for each program. Speedups of 1.8--3.1 are
achieved, but this is probably strongly dependent on
the original set of operators. The paper uses lccs
intermediate code operators \cite{fraser&hanson91a}."
}

@Proceedings{popl95,
booktitle = "Principles of Programming Languages (POPL '95)",
title = "Principles of Programming Languages (POPL '95)",
year = "1995",
key = "POPL '95"
}

@string{spe="Software---Practice and Experience"}

@Article{hoogerbrugge+99,
author = "Jan Hoogerbrugge and Lex Augusteijn and Jeroen Trum
and Rik van de Wiel",
title = "A code compression system based on pipelined
interpreters",
journal = spe,
volume = "29",
number = "11",
pages = "1005--1023",
month = sep,
year = "1999",
OPTannote= ""
}

@InProceedings{lattendresse&feeley03,
author = {Mario Latendresse and Marc Feeley},
title = {Generation of Fast Interpreters for {Huffman}
Compressed Bytecode},
booktitle = {Interpreters, Virtual Machines and Emulators
(IVME~'03)},
pages = {32--40},
year = {2003},
url1 = {http://www.complang.tuwien.ac.at/anton/ivme03/proceedings/latendresse.ps.gz},
url2 = {http://portal.acm.org/ft_gateway.cfm?id=858574&type=pdf},
abstract = {Embedded systems often have severe memory
constraints requiring careful encoding of
programs. For example, smart cards have on the order
of 1K of RAM, 16K of non-volatile memory, and 24K of
ROM. A virtual machine can be an effective approach
to obtain compact programs but instructions are
commonly encoded using one byte for the opcode and
multiple bytes for the operands, which can be
wasteful and thus limit the size of programs
runnable on embedded systems. Our approach uses
canonical Huffman codes to generate compact opcodes
with custom-sized operand fields and with a virtual
machine that directly executes this compact code. We
present techniques to automatically generate the new
instruction formats and the decoder. In effect, this
automatically creates both an instruction set for a
customized virtual machine and an implementation of
that machine. We demonstrate that, without prior
decompression, fast decoding of these virtual
compressed instructions is feasible. Through
experiments on Scheme and Java, we demonstrate the
speed of these decoders. Java benchmarks show an
average execution slowdown of 9%. Compression
factors highly depend on the original bytecode and
the training sample, but typically vary from 30% to
60%. }
}

>All of the opcodes are one byte in size. Given the instruction:
>
>push address
>
>'push' takes one byte, and the address takes 4 bytes. If I pack
>the code into a byte array, this takes 5 bytes. However, now
>that means that address isn't on a word boundary. If I load
>'address' using:
>
>unsigned char *code;
>
>operand = *(int *)code;
>
>I incur a speed hit on many processors, and it may even fail on
>others, since code probably isn't suitably aligned.
>
>But if I use:
>
>memcpy(&operand, code, sizeof(operand));
>
>or
>
>operand =
>(code[pc]) |
>(code[pc + 1] << 8) |
>(code[pc + 2] << 16) |
>(code[pc + 3] << 24);
>
>I don't incur the miss-aligned speed hit, but it is nowhere as
>fast as "operand = *(int *)code;" could be, if code were suitably
>aligned.
>
>I could make all opcodes a word in size, and make the code array an
>array of integers. But that would dramatically increase the size of
>the byte-code required by the VM.

Not necessarily. If you use superinstructions, you may have
relatively few VM instructions compared to operands; with dynamic
superinstructions you can get it down to one VM instruction per VM
branch target (but you need space for the dynamic superinstruction
code; I don't know if you consider that part of the VM, which is not
as sensitive for you).

>Is there a way I can have both fast loading of operands, and
>small byte-code size?

This particular problem has been addressed by Ogata et al. [ogata+02],
but I am not sure that I would use it if I could define my own
bytecode (they wanted to do JVM interpretation without rewriting the
JVM code).

Other techniques that come to mind:

- Use byte-sized operands where possible.

- Collect VM instructions and byte-sized operands within words, but
let word-sized operands start at the next word; this is somewhat like
the two-stream (sections) idea suggested by our esteemed moderator,
but reduces the overhead of dealing with two instruction pointers on
VM control flow. Example of such VM code (for a machine with 4-byte
words):

inst 1 | byte-operand 1 of inst 1 | inst 2 | inst 3
word-operand 1 of inst 1
word-operand 2 of inst 1
word-operand 1 of inst 2
word-operand 1 of inst 3
byte-operand 1 of inst 3 | inst 4 ...
....

Various methods for implementing such a scheme efficiently come to
mind, but I won't go into that now.

@InProceedings{ogata+02,
author = {Kazunori Ogata and Hideaki Komatsu and Toshio
Nakatani},
title = {Bytecode Fetch Optimization for a {Java}
Interpreter},
crossref = {asplos02},
pages = {58--67},
annote = {The paper presents a Java Bytecode interpreter for
the PowerPC architecture and some optimizations and
evaluates them: stack caching (a few variations),
position-based handler customization and
position-based speculative decoding (software
pipelining of the interpreter). Position-based
handler customization deals with different
alignments of bytecodes by having four states in the
interpreter for the different alignments, each state
with its own specialized copy of the
interpreter. For stack caching they evaluated a
fixed one-TOS-register organization with
write-through caching (5.6\% speedup over base), and
dynamic stack caching with two registers (3 states,
7/9% speedup over base), and used the write-through
organization for further experiments; write-through
is not compared empirically to
write-back. Position-based handler customization
buys another 19\%, and software pipelining an
additional 3.4\%. The paper also presents results on
memory traffic (both I and D).}
}

@Proceedings{asplos02,
title = "Architectural Support for Programming Languages and
Operating Systems (ASPLOS-X)",
booktitle = "Architectural Support for Programming Languages and
Operating Systems (ASPLOS-X)",
year = "2002",
key = "ASPLOS-X"
}

Followups set to comp.compilers.

- anton
--
M. Anton Ertl
anton@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/home.html
From:VBDis
Subject:Re: virtual machine efficiency
Date:30 Dec 2004 01:02:08 -0500
"ed_davis2" schreibt:
>Is there a way I can have both fast loading of operands, and small
>byte-code size?

Some ideas:

1) Align arguments to word boundaries.

You can align the arguments to the next word boundary, e.g. by
inserting NOP instruction bytes, or by SKIP1 until SKIPn instructions
that "jump" forward immediately to the next instruction byte, residing
just before the word boundary.

It also will be possible to have multiple opcodes for the same
instructions, which take arguments, with each of these instructions
including an specific increment of the PC to the next word boundary
before fetching the argument from that address. This will move the
calculation of the increment from runtime (interpretation) into
"compile" time, when the code is stored in memory.

But I assume that processing these instructions, or calculating the
address of the next word boundary, or even constructing the value from
multiple byte fetches, will take much more time than reading a
misaligned argument from memory. The excess data from the added fetch
typically will still be available (in the cache) when the next
instruction is read, so that IMO the penalty of a misaligned read is
neglectable, in detail in an almost sequential access to the data.

2) Separate arrays for byte code and word arguments.

I also thought of separate code and argument areas, as John suggests,
but then all jumps and calls must specify the addresses or indices in
both areas, what again increases the memory usage and the time to
handle the duplicate position counters. In addition more delays may
occur for fetching data from different (nonsequential) addresses.

Even if I know that on some machines misaligned reads are expensive,
does anybody know of a concrete machine where reading a misaligned
"word" will definitely take longer than reading the according number
of individual bytes???

DoDi
From:Chris F Clark
Subject:Re: virtual machine efficiency
Date:30 Dec 2004 01:05:45 -0500
ed_davis2 asked:
> I have developed a simple stack based virtual machine. I would like
> it to be as efficient as possible, in terms of both speed and object
> code size. Size, as in size of the object code the VM processes
> (byte-code), and not necessarily the size of the VM.

You'll always have to make some tradeoffs between size and space.
However, you probably can do quite well, if you note that a lot of
operands are repeated. So, are a lot of opcodes and code sequences,
but we'll concentrate on the operands first.

For example, there are probably on the order of 255 common operands in
a code sequence, so use something akin to sliding window data
compression to capture that fact. Here is one possibility:

struct operation {
char opcode;
char operand;
};

Use the operand values 0-254 to refer to the most recent 255 operands
(as appearing in the code. Use operand value 255 to indicate a "new"
operand being takens from a list of word aligned locations. When you
fetch a new operand, retire the oldest operand from the most-recently
used list. You also increment the pointer into the operand list, so
that the next "new" operand fetches a different element from the list.

address operand_mru[255]; // 0-254
address new_operand_list[]; // 255 copies 1 from here into the mrs

There are lots of heuristics you can build into the 255 most recent
operand memory. You can think of it as a cross between a register
file and a cache.

For example, you can preload the cache with the operand list. This
saves the 1st 255 uses of the "new" operand. You can also load/store
the cache on a call/return by call return basis. Note, you probably
don't want to actually copy the cache on the call/return, but instead
keep a cac he pointer. This is a place where profiling will
pay-off. Measure a couple of different strategies and keep measuring
as one tunes.

Next, since most routines do not have 255 local variables (unless
you've done extensive inlining), you can consider making the cache
smaller, and using some of the operand values to represent function
parameters (carving out 16 or even 32 function parameters out of the
cache won't significant impact the cache, and will handle almost any
calls).

You can even compress more, say your profiling suggests that you can
get away with 24 cache entries, 7 paramete locations, and 1 "new"
operand, that fits in 5 bits. Then, if can find 7 common opcodes (3
more bits(, you can pack the entire opcode and operand into 1 byte.

You can also compress the operation spcae the same way by reserving,
some number of opcodes as "compresses" opcodes, that do a table lookup
into an instruction cache. If the instrcution cache can hold
sequences of instructions, you are close to Ziv-Lempel eocnding.

Back to operand encoding, it is worth noting that the entries in the
"address" mrs et. al. don't have to be just simple pointers. You can
re-invent the concepts of indirect and indexed addressing (and even
auto-increments if it suits you) in your virtual machine. RISC
hardware architectures tended to drop these concepts. However, the
tradeoffs in a hardware implementation are different than a software
one. Again, measure and tune.

Of course, the farther you push this. the farther you get from a
simple VM.

Going the ooposite direction. One can borrow a concept that hardware
designers have used, aligning instructions with no-ops. If you want
your essential instructions to be half-word (or byte) aligned, but you
want your insructions with 32 bit operands to have the operands
algined on 32 bit boundaries. First, determine what opcode
length/position would put the operand on a 32 bit boundary. For
example, if you decide to have byte boundary instruction with a 1 byte
opcode field, putting an instruction on the 3rd byte should then align
the following opcode field. If you wnat to generate such an
instruction and you aren't on the appropriate boundary, you can either
insert no-op instructions to get to ehte deisred alginment, or pull
instructions from further in the instruction list to try to fill the
space. Note, that doing so, is potentially difficult, and whole
phases are written to "schedule" instructions to try to minimize
various penalties. Now many of those penlties don't apply, since your
VM is likely to execute each instruction in serial, rather than doing
multi-issue instructions that hardware can do.

However, as more hardare gets to be multi-issue and out-of-order, the
things which are cheap and parallel in hardware become more and more
applicable to software. Not long ago, I would not have recommended
using an 8 bit opcode and a 24 bit address in a VM, because the mask
instruction to get a 32 bit pointer out of the 24 bit address, might
have been expensive. These days I would recommend measuring the
effect before making the decision. I suspect these days on any P 4 or
better class machine, the mask is free. Of course, on the same class
of machine I bet the mis-alignment penalty is nil also. The real
measure is the amount of work one needs to do, especially in terms of
memory traffic and most importantly non-cache memory traffic.

Hope this helps,
-Chris

*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
23 Bailey Rd voice : (508) 435-5016
Berlin, MA 01503 USA fax : (978) 838-0263 (24 hours)
From:cr88192
Subject:Re: virtual machine efficiency
Date:31 Dec 2004 13:07:09 -0500
"Chris F Clark" wrote in message
> ed_davis2 asked:
>> I have developed a simple stack based virtual machine. I would like
>> it to be as efficient as possible, in terms of both speed and object
>> code size. Size, as in size of the object code the VM processes
>> (byte-code), and not necessarily the size of the VM.
>
> You'll always have to make some tradeoffs between size and space.
> However, you probably can do quite well, if you note that a lot of
> operands are repeated. So, are a lot of opcodes and code sequences,
> but we'll concentrate on the operands first.
>
> For example, there are probably on the order of 255 common operands in
> a code sequence, so use something akin to sliding window data
> compression to capture that fact. Here is one possibility:
>
> struct operation {
> char opcode;
> char operand;
> };
>
> Use the operand values 0-254 to refer to the most recent 255 operands
> (as appearing in the code. Use operand value 255 to indicate a "new"
> operand being takens from a list of word aligned locations. When you
> fetch a new operand, retire the oldest operand from the most-recently
> used list. You also increment the pointer into the operand list, so
> that the next "new" operand fetches a different element from the list.
>
> address operand_mru[255]; // 0-254
> address new_operand_list[]; // 255 copies 1 from here into the mrs
>
> There are lots of heuristics you can build into the 255 most recent
> operand memory. You can think of it as a cross between a register
> file and a cache.

yes, this sounds pretty cool really, especially if one's oprands may be
complex or allocated types.

> For example, you can preload the cache with the operand list. This
> saves the 1st 255 uses of the "new" operand. You can also load/store
> the cache on a call/return by call return basis. Note, you probably
> don't want to actually copy the cache on the call/return, but instead
> keep a cac he pointer. This is a place where profiling will
> pay-off. Measure a couple of different strategies and keep measuring
> as one tunes.
>
> Next, since most routines do not have 255 local variables (unless
> you've done extensive inlining), you can consider making the cache
> smaller, and using some of the operand values to represent function
> parameters (carving out 16 or even 32 function parameters out of the
> cache won't significant impact the cache, and will handle almost any
> calls).
>
> You can even compress more, say your profiling suggests that you can
> get away with 24 cache entries, 7 paramete locations, and 1 "new"
> operand, that fits in 5 bits. Then, if can find 7 common opcodes (3
> more bits(, you can pack the entire opcode and operand into 1 byte.
>
> You can also compress the operation spcae the same way by reserving,
> some number of opcodes as "compresses" opcodes, that do a table lookup
> into an instruction cache. If the instrcution cache can hold
> sequences of instructions, you are close to Ziv-Lempel eocnding.

also cool imo.

could make sense for saving space, just I feel a little unsure of the
overhead or usefulness in many cases.

> Back to operand encoding, it is worth noting that the entries in the
> "address" mrs et. al. don't have to be just simple pointers. You can
> re-invent the concepts of indirect and indexed addressing (and even
> auto-increments if it suits you) in your virtual machine. RISC
> hardware architectures tended to drop these concepts. However, the
> tradeoffs in a hardware implementation are different than a software
> one. Again, measure and tune.

well, I tend to use a large number of compound instructions myself...

> Of course, the farther you push this. the farther you get from a
> simple VM.

my vm is still fairly simple, and specialized for a very specific
task. at least it is a little more general than my previous one, but
my previous one didn't even really use bytecode...

> Going the ooposite direction. One can borrow a concept that hardware
> designers have used, aligning instructions with no-ops. If you want
> your essential instructions to be half-word (or byte) aligned, but you
> want your insructions with 32 bit operands to have the operands
> algined on 32 bit boundaries. First, determine what opcode
> length/position would put the operand on a 32 bit boundary. For
> example, if you decide to have byte boundary instruction with a 1 byte
> opcode field, putting an instruction on the 3rd byte should then align
> the following opcode field. If you wnat to generate such an
> instruction and you aren't on the appropriate boundary, you can either
> insert no-op instructions to get to ehte deisred alginment, or pull
> instructions from further in the instruction list to try to fill the
> space. Note, that doing so, is potentially difficult, and whole
> phases are written to "schedule" instructions to try to minimize
> various penalties. Now many of those penlties don't apply, since your
> VM is likely to execute each instruction in serial, rather than doing
> multi-issue instructions that hardware can do.

yes...

> However, as more hardare gets to be multi-issue and out-of-order, the
> things which are cheap and parallel in hardware become more and more
> applicable to software. Not long ago, I would not have recommended
> using an 8 bit opcode and a 24 bit address in a VM, because the mask
> instruction to get a 32 bit pointer out of the 24 bit address, might
> have been expensive. These days I would recommend measuring the
> effect before making the decision. I suspect these days on any P 4 or
> better class machine, the mask is free. Of course, on the same class
> of machine I bet the mis-alignment penalty is nil also. The real
> measure is the amount of work one needs to do, especially in terms of
> memory traffic and most importantly non-cache memory traffic.

err, I sort of doubt masking and misalignment are "free" per-se, but they
are not that expensive. afaik, even on older processors masking has been
pretty cheap as well.

yeah, keeping the working set small is an issue, especially with a
dynamically typed language that tends to tear through a lot of heap, but
then again, at this point I don't think cache misses are the problem...

at least in some very specific cases my interpreter can run in constant
memory, which is something at least...
From:cr88192
Subject:Re: virtual machine efficiency
Date:30 Dec 2004 01:03:23 -0500
"ed_davis2" wrote
>I have developed a simple stack based virtual machine. I would like
> it to be as efficient as possible, in terms of both speed and object
> code size. Size, as in size of the object code the VM processes
> (byte-code), and not necessarily the size of the VM.
>
> But I've run into one of those situations where I can't figure out how
> to have both.

one is limited really.

description of problems of mixing addresses/literals with opcodes


I once started running into this issue, along with the occurance that
mixing addresses in with bytecode is just imo bad. also, there was
the issue that 256 just seems a bit small for an opcode space, and
typical chaining approaches are imo ugly.

So, how I did it was to define the bytecode as an array of unsigned
shorts. each opcode has a layout, namely the lower 12 bits give the
opcode number and the upper 4 give the length. this is followed by
any oprands or such (up to about 30 bytes).

Now, I did not embed values directly into most opcodes, instead, I
defined another structure as a kind of out of band index table.

Keeping the table out of band has a few uses: I can put whatever in it
and not worry about type; I am a lot more free of the
compiletime/runtime issue wrt immediates; it avoids unnesessary
duplication of literals; it makes operations that can recieve
different literal types a lot more straightforward; it makes gc or
storing/transmitting code more straightforward; ...

So, pretty much all the literals in bytecode are shorts (though some
are signed, but this is a seperate issue). I divide up my bytecode
code at the function level, and more or less assumes that no single
function will exceed about 64kB (yes, jumps are short aligned, but are
limited to +-32k shorts).

similarly, my opcodes are often fairly high-level and abstract as
well, but this depends a lot on the opcode.

or such...
   

Copyright © 2006 knowledge-database   -   All rights reserved