The processor needed a language that was higher level than assembler but was simple enough to understand in it's entirety (including the compiler) and yet with minimal overhead over assembly language.   I'm writing that language right now, it is called Third as it's a plausible prequel to Forth.   For a great introduction to this language see

The main advantage of Third and Forth is that code can be written as a list of subroutine addresses that are called functions (Forth calls them words).  Specifically, we use a direct threaded interpreter.   This is compact and efficient, all that has to be done to exit one function and start executing the next is to call the code:

P = *E++  # execute the NEXT code

Functions may be written in assembler or Third.   If they are written in Third then an initial few lines of assembler are needed to start the Third interpreter on the upcoming list of function addresses.

*C = E--      # docolon: Push execution pointer onto call stack
 E = P + 1    # docolon: Execution pointer to start of labels
 P = *E++     # docolon: Execute function1

The last function address is that of return (Forth calls this exit).

The main differences between Forth and Third are:

 Forth Third
 Is a compiler, editor, operating system and filing system Is just the compiler
 Is purely stack based Uses the stack for input and output of a function but can also take static arguments from the code
 Has standards Is tailored and unique to this processor
 Is designed to be complete Is complete enough for this processor
 quoting of constants is implicit (inserting literal) All constants have to be quoted with ' (or 2', 3' or 4')
 Has multi-word control flow All control flow operators are single word
  Has anonymous functions (mostly used for control flow)

  • Functions/subroutines are called functions (Forth calls these words)
  • The stacks are the data stack and the call stack (Forth calls these the stack and the return stack)
  • Functions that operate on multiwords are prefixed by the size, e.g. 4dup duplicates 4 words on the stack (e.g. a 64 bit number)
  • Functions consume their arguments by default (unless it's clear they don't like pick) unless they start with the letter q
  • Functions that start with the letter q query the stack and do not consume, e.g. q4signword leaves the 4 word on the stack and adds the sign word (executes pick(4))

Third may be called from assembler very efficiently:
  •  E = P + I          # docolon: Execution pointer to start of labels
  •  P = *E, E += I  # docolon: Execute first function
  • :function1
  • :function2
  • :quit
  • E = A + B         # whatever code you want to follow
Where quit is simply:
  • quit:
  • P = E
Of course,  you still have to save E on the call stack at some point (*C = E, C -= I) and retrieve it later if this code was called from Third and you want to return there.


The other '80's language for micros is FORTH.   I always wanted to learn but somehow always had something better to do with my time (maybe the same should be true today).

Brad Rodriguez has written an excellent explanation of Design Decisions in the Forth Kernel which is well worth reading.

Implementing a direct threaded interpreter is quite efficient, but firstly some terminology.   A native function is one written in assembler, just a sequence of instructions to execute (ending in NEXT not RETURN, which is explained below).   A derived function starts with a small amount of native code (called ENTER) and is then followed by a number of addresses each of which is the address of another function (either native or derived).   Derived functions end in the address of EXIT.

Implementing this needs three special registers, which are called FP, SP and DS for now (later these will be mapped to real registers, e.g. C, D and E)
  • FP is the function pointer, it points to the address of the next function to be run.  It operates as an ascending list.
  • FS is the function stack, it keeps FP when another derived function is called.  Here it is implemented as an descending stack.
  • DS is the data stack, the location of the data which is popped and pushed on each function call.  Again it is implemented as an descending stack.
This leaves two registers, (probably A and B) available for general use.   It's tight but doable.   Long functions should store FP to get a register.

Forth has three bits of code which together are called the inner interpreter, these are NEXT, ENTER and EXIT.

NEXT is the code that ends every native function by jumping to the next one in the list.   You can think of it as a RET from this subroutine and a JSR to the next one.  It is always appended to every native function (it's too small to be worth jumping to).   This is very efficient, so provided programs are shallow (not too many nested calls until you get to native code) then this FORTH implemmntation should be very efficient (c.f. indirect threading).   As I don't intend to write huge programs, they should be shallow and efficient.
  • P = *FP++    # read the next function in the list and jump to it, setting up FP to point to the next function
ENTER (also called DOCOL or DOCOLON) is the native code that starts every derived function.   It sets up the interpreter to read function addresses from the memory after this function address (the code below is only three words so there's no point in calling a subroutine with the address of the code to run).
  • *FS-- = FP    # push FP onto the stack as we are starting a new function and don't want to lose it.
  • FP = P + 1    # set up FP for this function to point to immediately after this code
  • P = *FP++     # jump to the first of the function addresses in this derived function (NEXT)
EXIT is the last function in a derived function, it pops the context from a stack and calls NEXT on the layer below.
  • FS = FS + 1
  • FP = *FS  # pop FP from FS
  • P = *FP++    # execute the NEXT code
For example  ADD is a native function
  • A = *DS++      # pop the top of the data stack to A
  • B = *DS          # pop the top of the data stack to B
  • A = A + B       # do ALU function
  • *DS = A         # store the result on the top of the stack
  • P = *PF++     # NEXT
The instruction set has the limitation that instructions can only make one memory access.   So given that the operands are in main memory (and not in registers) then it is always going to take two reads, one write and one RETURN, so four instructions are the best that's possible and FORTH has little overhead here.

MUL is somewhat tedious, especially in the loop unrolled version:
  • A = *DP++     # first argument to MUL
  • B = *DP++     # second argument to MUL
  • *SP-- = C       # push FP onto the stack as we need the space
  • *SP   = D       # push DP onto the stack as we need the space
  • C = Z             # result - initialise to zero
  • D = U             # bit mask - initialise to 1
  • Z = A | D        # test lower bit
  • gt0 C = C + B # if lower bit set add in A (shifted)
  • B = B + B        # shift up bit mask
  • D = D + D       # shift up bit mask
  • Z = A | D         # test next lowest bit
  • gt0 C = C + B # if bit set add in A (shifted)
  • ...
  • gt0 C = C + B  # if top bit set add in A (shifted)
  • *--DP++ = A    # store the result on the top of the stack
  • D = *SP++      # reclaim DP
  • C = *SP          # reclaim FP
  • P = *FP++      # NEXT

It's going to be tight on registers for doing more complicated things.  Need to look at some of the FORTH implementations to see how much of a limitation this is.   The minimal transistor count constraint meant that there weren't going to be spare registers and so spilling to a data stack was inevitable.   It all looks quite efficient.

eForth seems to be a good minimal FORTH to get running first.   Then extend it using JONESFORTH.