Compiling a subset of JavaScript to ARM assembly in Haskell

~/blog/js-to-asm-in-hs

Published:


I recently got a copy of the book Compiling to Assembly from Scratch by Vladamir Keleshev, which details how to write a compiler for a subset of JavaScript to 32-bit ARM assembly code. The choice to use ARM assembly is mainly for its simplicity in comparison to x86.

Keleshev elects to use TypeScript to write the compiler, which, as he explains in the preface, is largely a compromise for its familiar C-like syntax while still providing many functional programming features desireable for writing a compiler. However, I chose to write my version of the compiler in Haskell as it's my favorite language, and the focus on functional programming from Keleshev makes it a natural choice for the translation.

Overall, I had a lot of fun writing this compiler as I got to learn more about the nitty-gritty low-level of how code is executed while getting more practice with Haskell. In this post, I won't cover every detail of the design of the compiler, but I'll try to hit on what I found to be the most important or interesting aspects of the code.

The Abstract Syntax Tree

The first step in writing our compiler is to define a data structure that abstractly represents the syntax of our source language, JavaScript. Once we convert the program source text to this internal representation (through parsing), our job will become to recursively traverse this tree.

We represent the abstract syntax tree (AST) in Haskell as a large sum type that looks like this:

data Expr
  = Number Integer
  | Identifier Text
  | If Expr Expr Expr -- condition, consequent, alternate
  | Add Expr Expr -- left, right
  ... -- other expression types here
  deriving (Eq, Show)

For each type of expression we want to represent, like number literals, identifiers or if expressions, we add a type constructor for that expression to Expr. With this data structure ready, we can now move on to parsing.

Parsing

The parser is the first step in our compiler pipeline. In the book, Keleshev chooses to write and use a small parser combinator library for his parser. Those familiar with Haskell will know this actually happens to be a real specialty of the language and the community, so translating this portion of the compiler was particularly natural.

I chose to use the megaparsec parser combinator library rather than writing my own as Keleshev does (though that would have been a fun exercise in itself). Parser combinators are small primitive functions that compose together in order to write a high-level parser that closely aligns with the prescribed grammar for the language.

The main difference between the Haskell and TypeScript code in this section is Haskell's support for monadic programming, which Keleshev humbly tries to imitate in TypeScript. An example should illuminate the difference here.

One part of the parser is for if statements. In TypeScript, the parser looks like this:

let ifStatement: Parser<AST> =
  IF.and(LEFT_PAREN).and(expression).bind((conditional) =>
    RIGHT_PAREN.and(statement).bind((consequence) =>
      ELSE.and(statement).bind((alternative) =>
        constant(new If(conditional, consequence, alternative)))));

This parses a sequence of tokens: the if token, then a left paren, then some expression which we bind as the conditional, then a right paren, then the consequence and alternative statements. We translate this code directly to Haskell using do-notation, which implicitly adds the and and bind higher-order functions used by Keleshev:

ifStatement :: Parser Expr
ifStatement = label "if statement" $ do
  _ <- ifToken
  condition <- parenthesized expr
  consequent <- statement
  _ <- elseToken
  alternate <- statement
  pure (If condition consequent alternate)

Here, since we don't need the parsed text from ifToken or elseToken, I just bind them to the empty identifier, _. This parser also uses a helper function called parenthesized, which takes a parser and outputs the same parser that matches only between two parentheses, so

parenthesized :: Parser a -> Parser a
parenthesized = between leftParen rightParen

Personally, I find the do-notation to be very helpful in increasing the readability of code like ifStatement which is written using monads, since it allows us to flatten the code and make it clear the order in which we are binding intermediate results.

There are other benefits to using an established parser combinator library like megaparsec rather than rolling our own. For example, one small part of the parser is parsing the list of arguments in a function call. For example, in a function call like

foo(1, 2 + 2, a);

we want a parser consumes this comma-separated list of expressions and returns Parser [Expr], which is a list of expressions wrapped in the carrier Parser type. Specifically we should parse the arguments as the list

[Number 1, Add (Number 2) (Number 2), Identifier "a"]

Keleshev's TypeScript to parse this kind of expression looks like this:

let args: Parser<Array<AST>> =
  expression.bind((arg) =>
    zeroOrMore(COMMA.and(expression)).bind((args) =>
      constant([arg, ...args])))
  .or(constant([]))

Essentially, this parser either finds one expression followed by zero or more comma separated expressions, or nothing, in which we return an empty list.

In Haskell, this job is easier, since megaparsec provides a useful sepBy combinator that does this for us. So we could instead write

args :: Parser [Expr]
args = expr `sepBy` comma

where expr parses any expression and comma parses a comma token with optional whitespace. But this parser is simple enough that in my code I keep this inline in the parser for function calls.

Once you get used to the style and idioms of parser combinators, the job of converting from a grammar to a parser for that grammar becomes fairly mechanical, yet megaparsec still gives great error messages and performance.

Code Generation Framework

With the parser set, we can now take a basic JavaScript program and translate it into the corresponding AST represented by our Expr type. Now we need to process that Expr and produce a text file of assembly code which we can then assemble and execute.

This suggests that our emit function to produce the assembly source should be

emit :: Expr -> Text

But, while we do want to output some text, we need to track some effects at the same time, namely compile time errors in the code and the state of variable bindings and other values.

Therefore, we take a traditional approach to structuring a Haskell program which is to build a stack of monads in which to perform these effects, and output that from our emit function instead. First, though, we need some data types that will be tracked in these effects.

We have a simple CompilerError data type that enumerates the possible compiler errors and an associated error message:

data CompilerError
  = Unsupported Text
  | UndefinedVariable Text
  deriving (Eq, Show)

Right now we have only two compiler errors, Unsupported for unimplemented features and UndefinedVariable for, well, undefined variables.

Then we have a Context data type that tracks three pieces of information we need throughout the code generation:

data Context = Context { 
  locals :: Map Text Int, 
  labelCounter :: Int, 
  nextLocalOffset :: Int
}

First is locals, a map of our variable bindings to their offset location in the stack. Second is labelCounter, which is a counter we need to generate unique labels that don't clash with each other, and lastly nextLocalOffset tracks where in the stack the last variable binding is located.

Finally, we want to progressively build up Text with the assembly source, so the most efficient way is to do this is to use the Builder type from the text package. So we have a simple type alias:

type AsmSrc = Builder

With all of that in place, we can finally build our monad transformer stack:

newtype Emit a = Emit {unEmit :: StateT Context (WriterT AsmSrc (Except CompilerError)) a}
  deriving (Functor, Applicative, Monad, MonadState Context, MonadError CompilerError, MonadWriter AsmSrc)

This is a computation that tracks three effects: the state of the code generation with StateT Context, any compiler errors with Except CompilerError, and the generated text builder, in WriterT AsmSrc. By using the monad transformers from mtl, we can access all of these effects in one data type, which we call Emit.

Then we need some way of pulling out either the compiler error or the generated code, so we define the following function do so:

execEmit :: Context -> Emit a -> Either CompilerError AsmSrc
execEmit state emit = runExcept (execWriterT (execStateT (unEmit emit) state))

which unwraps each monad from the inside-out.

With that framework set up, we can finally get around to defining our emit function, which now has the type

emit :: Expr -> Emit ()

meaning that it takes in an expression, and returns an Emit () computation. There is no value to wrap in that computation, since this procedure is all "side-effects," namely tracking state, throwing errors, and building up the generated assembly.

We add a short helper procedure to add a line to our assemply source, which we will use often:

addLine :: Text -> Emit ()
addLine ln = tell (Builder.fromText (ln <> "\n"))

This function uses tell to append the given text with a newline to the Builder that lives in our WriterT monad. With this we can write

addLine ".global main"

to add that line of text to our assembly.

ARM-32 Assembly Code

Now that we have this monadic framework set up, how do we actually generate aeembly code? The general idea is to visit each node in our AST and emit some assembly code based on the type of that node. Specifically, we generally put intermediate values in the program into the "argument registers", which in ARM assembly are the first four registers, r0 through r3. Let's look at some examples:

Booleans

To compile booleans, we place the number 1 into the first register, r0, if the value is true, and otherwise place the number 0 into r0. In our emit procedure, that looks like this:

emit :: Expr -> Emit ()
emit expr = case expr of
  Boolean val ->
    if val then
      addLine "  mov r0, #1"
    else
      addLine "  mov r0, #0"

This means that our representation for true and false is essentially identical to just using 1 and 0, we aren't distinguishing those types. Also note that for readability of the generated assembly, we are manually moving over instruction lines by two spaces at the start of each text literal. Now let's look at numbers.

Numbers

We only deal with integers in this toy compiler, so to compile them, we just put the value of integer into r0.

-- in emit :: Expr -> Emit ()
Number val ->
  addLine ("  ldr r0, =" <> toText val)

We use the pseduo-instruction ldr (short for load register) here instead of mov in case that the number does not fit into the relatively small value of the immeditate. With that literal assumed to be in r0, we can access it when processing another AST node such as Add.

Infix Operators

Compiling an infix expression like 3 + 4 is a bit more complicated. Our first thought would be to put the left value in r0 and the right value in r1, then add them, but that would fail if the left expression was more complicated and could not fit entirely in r0, such as some nested expression like (5 + 3) + 2. In that case it would clobber the right expression stored in r1.

Instead, we have to make use of the stack. Our strategy will be to emit the left expression, push it onto the stack, then emit the right expression, and pull the left back into r1. Then even if we have a nested expression they won't overwrite each other in memory.

There is an extra complication to using the stack, however. When the call stack interacts with external interfaces, such as calling a libc function (which we will use to provide basic functionality such as printing characters), the stack pointer (the address of the start of the stack) must be aligned at the 8-byte boundary. For simplicity, Keleshev elects to always maintain this invariant when utilizing the stack. This means if we want to push a single word onto the stack, we must also push some dummy value, such as the ip register, as well.

In the end, our assembly code for Add looks like this:

Add left right -> do
  emit left
  addLine "  push {r0, ip}"   -- push left (r0) onto stack with alignment 
  emit right
  addLine "  pop {r1, ip}"    -- restore left result from stack into r1
  addLine "  add r0, r1, r0"  -- add r1 and r0, store result in r0.

In fact, the code to compile other infix operations like subtraction, multiplication, and equality comparison would be identical except for the last line. Therefore, we can refactor this to write a more general emitInfix function, which takes some Emit action to place after the standard code for working with the left and right expressions:

emitInfix :: Expr -> Expr -> Emit () -> Emit ()
emitInfix left right action = do
  emit left
  addLine "  push {r0, ip}"  -- push left (r0) onto stack with alignment 
  emit right
  addLine "  pop {r1, ip}"   -- restore left result from stack into r1
  action                     -- perform action on r0 and r1

Then we can rewrite the emit case for Add simply as

Add left right -> emitInfix left right $
  addLine "  add r0, r1, r0"

The equality infix operator == works the same way, except instead of adding the two values, we compare them with the cmp instruction, returning 1 if they are equal and 0 otherwise. It looks like this:

Equal left right -> emitInfix left right $ do
  addLine "  cmp r1, r0"    -- compare left to right
  addLine "  moveq r0, #1"  -- if equal, store 1 in r0
  addLine "  movne r0, #0"  -- otherwise store 0 in r0

The instruction moveq only executes if the result of cmp r1, r0 was that they were equal, and the opposite for movne.

If Statements

We can make use of the same kind of conditional logic we saw in the implemenation of the equality operator to compile if statements.

Here, we also need the concept of labels, which are simply named blocks of assembly that we can branch to from some other part of the code. Essentially, to compile an if statement, we create a label with the falsey-branch and a label for the code that runs after the if statement. Then we emit the condition expression and check if it is truthy. If it is, we continue onto the consequent block, and otherwise branch to the alternate block. Afterwards, we always branch to the rest of the code.

We can name the labels anything, but we need to be careful because we need them to be unique so that if there are two if statements in a program, the labels do not conflict with each other. Therefore we carry around a counter in our Context that we use to create a unique label, with this makeLabel function:

makeLabel :: Emit Text
makeLabel = do
  incrementLabelCounter
  counter <- gets labelCounter
  pure (".L" <> toText counter) -- prefix "L" is standard for code generation 

incrementLabelCounter :: Emit ()
incrementLabelCounter = modify (\s -> s {labelCounter = labelCounter s + 1}) 

The helper function incrementLabelCounter simply adds one to the label counter and adjusts the state in the Emit monad. Now here is the full code for the If case of the emit function, where the b instruction is short for "branch":

If condition consequent alternate -> do
  ifFalseLabel <- makeLabel -- unique label for falsey branch
  endIfLabel <- makeLabel   -- unique label for after if is evaluated
  emit condition            -- emit condition code
  addLine "  cmp r0, #0"    -- check if condition is false
  addLine ("  beq " <> ifFalseLabel) -- if so, branch to ifFalse label
  emit consequent           -- emit consequent, executed only if we do not branch to ifFalse 
  addLine ("  b " <> endIfLabel) -- branch to endIf label either way
  addLine (ifFalseLabel <> ":")  -- define ifFalse label
  emit alternate                 -- ifFalse label contains alternate code
  addLine (endIfLabel <> ":")    -- define endIf label with whatever comes next

As described above, we create two unique labels, then arrange the code as necessary within these new labels.

Variable Bindings

Next up in our whirlwind compiler tour is variable bindings. In our Context we store a Map called locals where the keys are the names of the variables and the values are their offset from the current frame pointer fp. Our plan is to store local variables in the current stack frame, and when we need to retrieve one, we lookup its name in locals and reference that memory location.

The only complication is that again, we store all the variables at two-word, 8-byte boundaries in the stack. To do this, we keep track of the nextLocalOffset, which is a running count of where to put the next local variable. In the end our emit case for Var looks like this:

Var name value -> do
  emit value
  addLine "  push {r0, ip}"
  Context{locals, nextLocalOffset} <- get
  setLocals (Map.insert name (nextLocalOffset - 4) locals)
  setNextLocalOffset (nextLocalOffset - 8)

Here we just emit the value into r0, push it onto the stack with alignment, then update our locals map and set the next local offset down 8 bytes. This is because the stack actually grows down in memory from the stack pointer.

Identifiers

Now we need to implement the reverse, pulling the value for a variable identifier. To do this, we implement a higher order procedure called withVarLookup. This function takes the name of a variable and a function that takes its offset as an input for some Emit action. After we look up the offset of the parameter, we pass it to this given function, and if the name does not exist in locals we throw an error:

withVarLookup :: Text -> (Int -> Emit ()) -> Emit ()
withVarLookup name withOffset = do
  env <- gets locals
  case Map.lookup name env of
    Just offset -> 
      withOffset offset
    Nothing ->
      throwError (UndefinedVariable name)

With this we can implement the emit case for Identifier, which is an AST node which holds the name of a variable:

Identifier name ->
  withVarLookup name $ \offset ->
    addLine ("  ldr r0, [fp, #" <> toText offset <> "]")

So to compile an identifier, we look up the name of the variable, then load r0 with the value of that variable, which is offset from fp. That bracket syntax allows us to reference a memory address, so for example [fp, #-4] would reference the memory location at the frame pointer minus 4 bytes. This is exactly what we want, except pluggin in the appropriate offset.

Function Calls

Even though we haven't yet implemented function definitions, we are going to first implement function calls. A function call looks like foo(1, 2) in JavaScript and is parsed into a Call node by our parser which has two components: the name of the callee and the list of arguments.

In ARM assembly, the convention is that the first four registers are for function arguments. If we have more than four arguments, they must be put on the stack, and for simplicity this difficulty is avoided in the baseline compiler in Keleshev's book.

In our compiler, we must still use the stack, however, since if we try and move the arguments directly into the registers they could clobber each other. So we will allocate memory on the stack, then emit each argument and store it in the appropriate spot. Then we pop the values back into the argument registers and call the bl instruction to "branch and link" to the name of the callee. In the end it looks like this:

Call callee args -> do
  let count = length args
  if count == 0 then
    addLine ("  bl " <> callee) -- branch and link to function name
  else if count == 1 then do
    emit (head args) -- if one arg, just emit it
    addLine ("  bl " <> callee)
  else if count >= 2 && count <= 4 then do
    addLine "  sub sp, sp, #16" -- allocate 16 bytes for up to four args
    forM_ (zip args [0..4]) $ \(arg, i) -> do
      emit arg
      addLine ("  str r0, [sp, #" <> (toText (4 * i)) <> "]") -- store each arg in stack, offset in multiples of 4 
    addLine "  pop {r0, r1, r2, r3}" -- pop args from stack into registers
    addLine ("  bl " <> callee)
  else
    throwError (Unsupported "More than 4 arguments not supported")

We have an optimization for functions of zero or one arguments, in those cases there is no need to use the stack. Otherwise we do as described above, and throw and Unsupported error if there are more than four arguments.

Function Definitions

Next we will actually be able to use these function calls by implementing function definitions. Defining a function has three main parts: the prologue, the function body, and the epilogue. For any function, we will have to perform the same prologue and epilogue to set up and tear down the function correctly.

Before executing the function body, we need to save the current values of two important registers: the frame pointer fp and the link register lr. We've already seen fp, which holds the address in the stack allocated for the current function call. The link register could also be called the return address, and tells us where to go back to after we call the function.

So in our function prologue, we push fp and lr onto the stack to save their current values, then set the frame pointer to the current stack pointer. Then we save the current values in the argument registers r0 through r4, since these are about to be clobbered with new values in the function body. The prologue looks like this:

addLine "  push {fp, lr}"  -- save the current frame pointer and link register 
addLine "  mov fp, sp"     -- set new frame pointer to current stack pointer
addLine "  push {r0, r1, r2, r3}" -- save argument registers

In the epilogue we essentially undo our work in the prologue. We deallocate the stack space we used in the current frame by moving fp into sp. Then, in case our function did not have a return statement, we implicitly return 0 by moving 0 into r0. Finally we restore the previous fp and pop the link register into the program counter register, pc, which sets us back to where we were in the program before the function. It looks like this:

addLine "  mov sp, fp"   -- deallocate stack space used for current frame
addLine "  mov r0, #0"   -- implicitly set return value to 0
addLine "  pop {fp, pc}" -- restore fp, pop the saved link register into pc to return 

In between these we need to emit the body. To execute the body, we need to set up a new local environment loaded with the parameters to the function. We do that in a function called bindParams, which takes a list of parameters and returns a new Context with the updated map of locals and nextLocalOffset. We store each paramter in slots of four bytes from the frame pointer:

bindParams :: [Text] -> Emit Context
bindParams params = do
  let offsets = [4 * i - 16 | i <- [0..]]
  setLocals (Map.fromList (zip params offsets))
  setNextLocalOffset (-20) -- 4 bytes after the allocated 4 params
  get

We actually return the Context wrapped in the Emit monad, since we want to keep the previous labelCounter preserved. That should continue to tick up between different function definitions, since otherwise we'll get repeat labels.

We need one more helper function, which we call withContext. This function takes two arguments, a context and some Emit action. All it does is save the value of the old context, put the new context, run the action, then restore the old context. This makes it so we can "pass" a fresh context to an action without clobbering the old one. Again, we avoid replacing the context wholesale to preserve that peksy labelCounter between function calls. In the end it looks like this:

withContext :: Context -> Emit () -> Emit ()
withContext ctx action = do
  Context{locals, nextLocalOffset} <- get
  put ctx
  action
  setLocals locals
  setNextLocalOffset nextLocalOffset

With all of the in place, all we need to do is put them together in our compilation of the function body, which is just this:

ctx <- bindParams params
withContext ctx (emit body)

Putting together the prologue, body, and epilogue, we finally arrive at the emit case for the AST node Function, which stores a function definition with its name, parameters and body:

Function name params body -> do
  if length params > 4 then 
    throwError (Unsupported "More than 4 params not supported")
  else do
    addLine ""
    addLine (".global " <> name)
    addLine (name <> ":")
    -- prologue
    addLine "  push {fp, lr}"  -- save the current frame pointer and link register
    addLine "  mov fp, sp"     -- set new frame pointer to current stack pointer
    addLine "  push {r0, r1, r2, r3}" -- save argument registers
    -- body
    ctx <- bindParams params
    withContext ctx (emit body)
    -- epilogue
    addLine "  mov sp, fp"   -- deallocate stack space used for current frame
    addLine "  mov r0, #0"   -- implicitly set return value to 0
    addLine "  pop {fp, pc}" -- restore fp, pop the saved link register into pc to return 

Again, we don't support more the 4 parameters, so we throw an error in that case. Otherwise we just create a new label for the function, and under that label put the code we've discussed above.

Return

We implement a Return AST node exactly as we implemented the function definition prologue, so all we need is

Return term -> do
  emit term
  addLine "  mov sp, fp"   -- reset the stack pointer to current frame pointer 
  addLine "  pop {fp, pc}" -- reset fp, pop lr into pc to return

Blocks

The last thing we'll look at is how to compile blocks. This is the simplest of all, since we just need to emit each statement contained in the block:

Block stmts ->
  forM_ stmts emit

Testing

And with that, we've implemented a Turing-complete language. We can store values in memory, and we can loop easily enough with recursion. Putting everything together we can look at a side-by-side comparison of a simple factorial program and the assembly that the compiler generates:

function main() {
  assert(factorial(5) == 120);
}

function factorial(n) {
  if (n == 1) {
    return 1;
  } else {
    return n * factorial(n - 1);
  }
}

function assert(x) {
  if (x) {
    putchar(46); // The character '.'
  } else {
    putchar(70); // The character 'F'
  }
}

When we assemble the generated assembly for this program, gcc will automatically link against libc, which gives us access to its standard functions like putchar. Here's the full assembly this produces:

.global main
main:
  push {fp, lr}
  mov fp, sp
  push {r0, r1, r2, r3}
  ldr r0, =5
  bl factorial
  push {r0, ip}
  ldr r0, =120
  pop {r1, ip}
  cmp r1, r0
  moveq r0, #1
  movne r0, #0
  bl assert
  mov sp, fp
  mov r0, #0
  pop {fp, pc}

.global factorial
factorial:
  push {fp, lr}
  mov fp, sp
  push {r0, r1, r2, r3}
  ldr r0, [fp, #-16]
  push {r0, ip}
  ldr r0, =1
  pop {r1, ip}
  cmp r1, r0
  moveq r0, #1
  movne r0, #0
  cmp r0, #0
  beq .L1
  ldr r0, =1
  mov sp, fp
  pop {fp, pc}
  b .L2
.L1:
  ldr r0, [fp, #-16]
  push {r0, ip}
  ldr r0, [fp, #-16]
  push {r0, ip}
  ldr r0, =1
  pop {r1, ip}
  sub r0, r1, r0
  bl factorial
  pop {r1, ip}
  mul r0, r1, r0
  mov sp, fp
  pop {fp, pc}
.L2:
  mov sp, fp
  mov r0, #0
  pop {fp, pc}

.global assert
assert:
  push {fp, lr}
  mov fp, sp
  push {r0, r1, r2, r3}
  ldr r0, [fp, #-16]
  cmp r0, #0
  beq .L3
  ldr r0, =46
  bl putchar
  b .L4
.L3:
  ldr r0, =70
  bl putchar
.L4:
  mov sp, fp
  mov r0, #0
  pop {fp, pc}

Afterword

That's really all I have. If you want to try out the compiler yourself or dig around with the full source, you can find it on GitHub here. I have a few more things implemented in the compiler than I showed here, like while loops, assignment statements, and array literals.

If you happen to have a Raspberry Pi running a 32-bit OS, then you can run the generated assembly natively. Otherwise, you'll need to cross compile and emulate from your machine. On Debian/Ubuntu, download the packages gcc-arm-linux-gnueabihf and qemu-user then use the shell scripts included in the repository.

I realized while writing this post that it would essentially be a worse version of Keleshev's book, since I wanted to give a tour of the compiler that made sense but also couldn't go into nearly as much detail as he did. So if you could follow along here, you'll probably find his book more enlightening.

It's not a perfect book, but it gives a great introduction to understanding and compiling to assembly that is hard to find in a lot of other places. In the second part of the book, which I don't cover here, he looks at some more advanced concepts like garbage collection, type checking, and working with heap objects.

Anyway, if you found this interesting, follow me on Twitter at @micah_cantor or feel free to reach out via email.