Of the many attempts to build Smalltalk Computers, the early ones depended heavily on microcoded architectures. For many years the Dorado was considered the ideal personal computer for Smalltalk, though the fact that it cost over $100 thousand to build and was so hot and loud that it had to be kept in a separate room meant that not many people got to actually use one.
The switch from bipolar to MOS technology for processors made them slower than main memory. In this environment dense instruction encoding no longer was important and simple hardwired architectures in the form of RISC won the day. Clock speeds for the new CMOS processors increased rapidly and soon zoomed past what even the expensive ECL machines had achieved and they just kept going up until recently. Memory density also increased exponentially but its speed is not that much faster than in the old days. So it seems that the time has come to once again consider microcode.
The fourth part of the "blue book" ("Smalltalk-80: The Language and Its Implementation") is a complete description of the Smalltalk virtual machine in the form of a Smalltalk program. Though actually executable, it was written in non object oriented style meant to be easy to translate into Pascal or C (which several people have done over the years) and to be understandable.
Several chapters of the "green book" ("Smalltalk-80: Bits of History, Words of Advice") describe the difference between the specification from the blue book and practical implementations of Smalltalk virtual machines. It is particularly interesting to read Chapter 7 (pages 113 to 126), which describes the Dorado implementation, while looking at the machine description and microcode listing. Some key parts were done in microcode while the rest was implemented in Alto machine language (a variation of the Data General Nova machine language) which was also implemented in microcode. Alternative schemes were discussed.
The "Back to the Future" paper describes the history and implementation of Squeak. There is a mostly complete description of the whole system in Smalltalk itself. What is missing can easily be seen by searching for the senders of #cCode: and friends. Things like the actual code for the sin primitive were not written and either punt to the underlying Smalltalk implementation or depend on a linked C library to provide the actual code.
The system described in the paper has a normal Smalltalk virtual machine and image and inside that you run the Squeak VM code written in Smalltalk which uses one large Array object inside of which is the Squeak image. So there are two VMs and two images. This is used for development and debugging - for normal use the Squeak VM code is translated into C, compiled into a native code standalone application.
The underlying Smalltalk VM and image in the development mode had to be a non Squeak system in the very first bootstrapping pass (obviously) but now they can be Squeak itself. The underlying image can be extremely reduced if it is going to only be used for this application and nothing more. The application can be divided into three parts: the bytecode interpreter, the memory system and the primitives. An interesting variation would be to short circuit the bytecode interpreter so that the underlying VM's interpreter is used instead while keeping the other two parts as they are. This would significantly speed up the simulated Squeak. If this were done on a machine whose native instruction set happens to be Squeak bytecodes, then the simulated Squeak would be fast enough that it could be used as the actual Squeak seen by the user. The underlying Squeak would be hidden away as part of the machine's firmware - the very reduced system image could even be stored mostly in ROM. In theory this system image would need its own VM to be able to run, but in practice a clever scheme could allow it to share the VM inside itself with the user image. This Escher painting shows how confusing supporting an image through itself can be:
SiliconSqueak can fetch either bytecodes from the instruction cache or 32 bit wide "microcode" instructions. The bytecodes are transformed into either a single microcode instruction or a call to a sequence of them, depending on how complex its operation is.
There is only one format for microcode:
The tag, uC and op fields relate to the ALU.
The uX and rX pair of fields select where the A and B inputs of the ALU come from and the destination of its output.
The unit encoding is as follows:
The last encoding for uB is not really a unit but just immediate values instead.
The fetch unit is controlled by the the cond bit, which indicates conditional execution, the ifTrue bit to select the sense of the tests and the #pc bit to indicate that the new PC is to be formed from the 32 bit word after the current microinstruction.
|0||0||0||next||execute the microinstruction which follows this one|
|0||0||1||call||save the current PC and go to the indicated address|
|0||1||0||fetch8||get the following bytecode and either translate it into a microinstruction using hardware or jump to one of 256 addresses, depending on the configuration registers|
|0||1||1||fetch4||combines the indicated address with the top 4 bits of the following byte to jump to one of 16 addresses|
|1||0||0||skipFalse||jump over the following microinstruction when the condition is false|
|1||0||1||jumpFalse||go to the indicated address when the condition is false|
|1||1||0||skipTrue||jump over the following microinstruction when the condition is true|
|1||1||1||jumpTrue||go to the indicated address when the condition is true|
Note that for Squeak the fetch4 wastes memory since only the top 2 or 3 bits of the extended bytecodes are used, so a full 256 byte area will include two or four copies of each instruction sequence. But this allows the hardware to be used with other languages (the option to replace the hardware decoding for fetch8 also helps). Another thing to note is that the PC is one of the context registers and can be the destination in any instruction, so there are many more control flow options than implied in the above table.
Each unit has its own state and can do simple operations, like addition, locally without involving the ALU. Most operations take just one clock cycle, but a cache miss can cause the whole microinstruction to be delayed by quite a few clock cycles.
This gives direct access to the arguments and temporary variables of the currently executing method. t0 is always the receiver while t1 is the first argument if there was one. The current top of the expression stack is always available as t31 (store) or as t30 (pop), and in the latter case the stack pointer is decremented after use.
The first 32 words of the receiver in the currently executing method (t0) can be accessed as i0 to i31.
There are four stream subunits, each represented by 8 registers: s00 to s07, s10 to s17, s20 to s27 and s30 to s37.
|sX2||base||points to the word with the first valid index in an object that holds the contents of the stream.|
|sX4||reset||will cause the pointer to the word with the first valid index in that object to appear in the base and index, the step to be set to 1 and the limit to the last word in the object just stored.|
|sX5||atEnd||when the limit is reached the index wraps around and the status register sX5 goes from false to true.|
|sX6||next||words can be read or written to the stream.|
|sX7||nextByte||bytes can be read or written to the stream.|
Besides being used for streams, this unit can efficiently implement array access and can also be used to access instance variables or method literals beyond the first 32.
These registers control the context of the operation of the other units.
The frame registers must be saved to the return stack on every message send and restored on every return.
|x00||self||normally has the same value as t0, but is different in the case of block methods. It is the base for i0 to i31.|
|x01||method||is the currently executing method. It is the base for f0 to f31 as well as for the byte stream.|
|x02||instruction pointer||either points to a bytecode in the current method or, if it is negative, points to a word in the virtual L2 intruction cache.|
|x03||context||either is nil or it points to a context object associated with this frame.|
|x04||frame pointer||indicates the offset from the start of the data stack of t0.|
The thread registers must be saved/restored when switching between one thread and another.
|x08||stack pointer||indicates the offset from the start of the data stack of the top element on the evaluation stack.|
|x09||return pointer||indicates the offset from the start of the return stack of the topmost saved value of the previous frame.|
|x10||dblockHigh||a 7 bit value indicating the current high data block in the stack cache.|
|x11||dblockLow||a 7 bit value indicating the current low data block in the stack cache|
|x12||rblock||a 7 bit value indicating the current return block in the stack cache|
|x13||savedIP||the value of the instruction pointer before the execution of a call.|
A single hardware node can be used for several "images". As has been described above, at least a system and a user image is typically used, though with two or more processors it would be possible to have each node deal with a single image. In the case where a node does need to switch between images, however, these registers need to be saved/restored.
|x14||bcTable||the offset of a 4KB region in the virtual L2 instruction cache with a block of 16 bytes corresponding to each possible bytecode. One bit indicates that the built-in hardware decoder should be used.|
|x15||pageTable||the physical address of a table used to map image addresses to physical addresses.|
These registers allow low level access to the hardware or have temporary variables that don't need to be saved.
|x16||stack cache pointer||selects a word in the stack cache.|
|x17||stack cache data||reads or writes the word selected by the previous register and increments that register.|
|x18||data cache pointer|
|x19||data cache data|
|x20||instruction cache pointer|
|x21||instruction cache data|
|x22||ring data||is used to directly inject data into the ring network or to fetch data from it.|
|x23||ring last/status||is used to inject the last word of a pack into the ring network or else to read the status of the network.|
|x24||rTop||the top of the return stack. Reading from it causes a "pop" while writing to it a "push".|
|x25||current byte||is the value pointed to by the instruction pointer in the current method.|
|x26||next byte||increments the instruction pointer.|
|x27||next word||increments the instruction pointer by 4.|
|x05-x07||scratch||is used for such things as remembering the selector of a message or the number of arguments.|
This can fetch either bytecodes or microinstructions, and when used as registers f0 to f31 allow access to the literals in the currently executing method.
When the tag bit is zero, the "raw" 32 bits are used as in other processors. When it is one, the values of the inputs and the result must be proper SmallIntegers or all side effects of the instruction will be cancelled (writing to the destination register, changing the stack pointer or the index in the stream units) and a call to a known address (what is normally called a "trap") will happen instead.
Besides the 32 bit result that the ALU sends to the D bus, it also generates a 1 bit result which is used for control flow.
The uC field selects which of the eight possible combinational subunits (some might include pipeline registers and so not be strictly combinational) is used to compute the result. The bits of the op field mean different things to each of these subunits:
In the FPGA version some of the subunits can be defined for specific applications. A floating point unit is a good example of things that can be added to the list of standard subunits in the future.