Compiling a subset of JavaScript to ARM assembly in Haskell
Published onI recently got a copy of the book Compiling to Assembly from Scratch by Vladamir Keleshev, which details how to write a compiler for a subset of JavaScript to 32-bit ARM assembly code. The choice to use ARM assembly is mainly for its simplicity in comparison to x86.
Keleshev elects to use TypeScript to write the compiler, which, as he explains in the preface, is largely a compromise for its familiar C-like syntax while still providing many functional programming features desireable for writing a compiler. However, I chose to write my version of the compiler in Haskell as it's my favorite language, and the focus on functional programming from Keleshev makes it a natural choice for the translation.
Overall, I had a lot of fun writing this compiler as I got to learn more about the nitty-gritty low-level of how code is executed while getting more practice with Haskell. In this post, I won't cover every detail of the design of the compiler, but I'll try to hit on what I found to be the most important or interesting aspects of the code.
The Abstract Syntax Tree
The first step in writing our compiler is to define a data structure that abstractly represents the syntax of our source language, JavaScript. Once we convert the program source text to this internal representation (through parsing), our job will become to recursively traverse this tree.
We represent the abstract syntax tree (AST) in Haskell as a large sum type that looks like this:
For each type of expression we want to represent, like number literals, identifiers or if expressions, we add a type constructor for that expression to Expr. With this data structure ready, we can now move on to parsing.
Parsing
The parser is the first step in our compiler pipeline. In the book, Keleshev chooses to write and use a small parser combinator library for his parser. Those familiar with Haskell will know this actually happens to be a real specialty of the language and the community, so translating this portion of the compiler was particularly natural.
I chose to use the megaparsec parser combinator library rather than writing my own as Keleshev does (though that would have been a fun exercise in itself). Parser combinators are small primitive functions that compose together in order to write a high-level parser that closely aligns with the prescribed grammar for the language.
The main difference between the Haskell and TypeScript code in this section is Haskell's support for monadic programming, which Keleshev humbly tries to imitate in TypeScript. An example should illuminate the difference here.
One part of the parser is for if statements. In TypeScript, the parser looks like this:
This parses a sequence of tokens: the if token, then a left paren, then some expression which we bind as the conditional, then a right paren, then the consequence and alternative statements. We translate this code directly to Haskell using do-notation, which implicitly adds the and and bind higher-order functions used by Keleshev:
Here, since we don't need the parsed text from ifToken or elseToken, I just bind them to the empty identifier, _. This parser also uses a helper function called parenthesized, which takes a parser and outputs the same parser that matches only between two parentheses, so
Personally, I find the do-notation to be very helpful in increasing the readability of code like ifStatement which is written using monads, since it allows us to flatten the code and make it clear the order in which we are binding intermediate results.
There are other benefits to using an established parser combinator library like megaparsec rather than rolling our own. For example, one small part of the parser is parsing the list of arguments in a function call. For example, in a function call like
we want a parser consumes this comma-separated list of expressions and returns Parser [Expr], which is a list of expressions wrapped in the carrier Parser type. Specifically we should parse the arguments as the list
Keleshev's TypeScript to parse this kind of expression looks like this:
Essentially, this parser either finds one expression followed by zero or more comma separated expressions, or nothing, in which we return an empty list.
In Haskell, this job is easier, since megaparsec provides a useful sepBy combinator that does this for us. So we could instead write
where expr parses any expression and comma parses a comma token with optional whitespace. But this parser is simple enough that in my code I keep this inline in the parser for function calls.
Once you get used to the style and idioms of parser combinators, the job of converting from a grammar to a parser for that grammar becomes fairly mechanical, yet megaparsec still gives great error messages and performance.
Code Generation Framework
With the parser set, we can now take a basic JavaScript program and translate it into the corresponding AST represented by our Expr type. Now we need to process that Expr and produce a text file of assembly code which we can then assemble and execute.
This suggests that our emit function to produce the assembly source should be
But, while we do want to output some text, we need to track some effects at the same time, namely compile time errors in the code and the state of variable bindings and other values.
Therefore, we take a traditional approach to structuring a Haskell program which is to build a stack of monads in which to perform these effects, and output that from our emit function instead. First, though, we need some data types that will be tracked in these effects.
We have a simple CompilerError data type that enumerates the possible compiler errors and an associated error message:
Right now we have only two compiler errors, Unsupported for unimplemented features and UndefinedVariable for, well, undefined variables.
Then we have a Context data type that tracks three pieces of information we need throughout the code generation:
First is locals, a map of our variable bindings to their offset location in the stack. Second is labelCounter, which is a counter we need to generate unique labels that don't clash with each other, and lastly nextLocalOffset tracks where in the stack the last variable binding is located.
Finally, we want to progressively build up Text with the assembly source, so the most efficient way is to do this is to use the Builder type from the text package. So we have a simple type alias:
With all of that in place, we can finally build our monad transformer stack:
This is a computation that tracks three effects: the state of the code generation with StateT Context, any compiler errors with Except CompilerError, and the generated text builder, in WriterT AsmSrc. By using the monad transformers from mtl, we can access all of these effects in one data type, which we call Emit.
Then we need some way of pulling out either the compiler error or the generated code, so we define the following function do so:
which unwraps each monad from the inside-out.
With that framework set up, we can finally get around to defining our emit function, which now has the type
meaning that it takes in an expression, and returns an Emit () computation. There is no value to wrap in that computation, since this procedure is all "side-effects," namely tracking state, throwing errors, and building up the generated assembly.
We add a short helper procedure to add a line to our assemply source, which we will use often:
This function uses tell to append the given text with a newline to the Builder that lives in our WriterT monad. With this we can write
to add that line of text to our assembly.
ARM-32 Assembly Code
Now that we have this monadic framework set up, how do we actually generate aeembly code? The general idea is to visit each node in our AST and emit some assembly code based on the type of that node. Specifically, we generally put intermediate values in the program into the "argument registers", which in ARM assembly are the first four registers, r0 through r3. Let's look at some examples:
Booleans
To compile booleans, we place the number 1 into the first register, r0, if the value is true, and otherwise place the number 0 into r0. In our emit procedure, that looks like this:
This means that our representation for true and false is essentially identical to just using 1 and 0, we aren't distinguishing those types. Also note that for readability of the generated assembly, we are manually moving over instruction lines by two spaces at the start of each text literal. Now let's look at numbers.
Numbers
We only deal with integers in this toy compiler, so to compile them, we just put the value of integer into r0.
We use the pseduo-instruction ldr (short for load register) here instead of mov in case that the number does not fit into the relatively small value of the immeditate. With that literal assumed to be in r0, we can access it when processing another AST node such as Add.
Infix Operators
Compiling an infix expression like 3 + 4 is a bit more complicated. Our first thought would be to put the left value in r0 and the right value in r1, then add them, but that would fail if the left expression was more complicated and could not fit entirely in r0, such as some nested expression like (5 + 3) + 2. In that case it would clobber the right expression stored in r1.
Instead, we have to make use of the stack. Our strategy will be to emit the left expression, push it onto the stack, then emit the right expression, and pull the left back into r1. Then even if we have a nested expression they won't overwrite each other in memory.
There is an extra complication to using the stack, however. When the call stack interacts with external interfaces, such as calling a libc function (which we will use to provide basic functionality such as printing characters), the stack pointer (the address of the start of the stack) must be aligned at the 8-byte boundary. For simplicity, Keleshev elects to always maintain this invariant when utilizing the stack. This means if we want to push a single word onto the stack, we must also push some dummy value, such as the ip register, as well.
In the end, our assembly code for Add looks like this:
In fact, the code to compile other infix operations like subtraction, multiplication, and equality comparison would be identical except for the last line. Therefore, we can refactor this to write a more general emitInfix function, which takes some Emit action to place after the standard code for working with the left and right expressions:
Then we can rewrite the emit case for Add simply as
The equality infix operator == works the same way, except instead of adding the two values, we compare them with the cmp instruction, returning 1 if they are equal and 0 otherwise. It looks like this:
The instruction moveq only executes if the result of cmp r1, r0 was that they were equal, and the opposite for movne.
If Statements
We can make use of the same kind of conditional logic we saw in the implemenation of the equality operator to compile if statements.
Here, we also need the concept of labels, which are simply named blocks of assembly that we can branch to from some other part of the code. Essentially, to compile an if statement, we create a label with the falsey-branch and a label for the code that runs after the if statement. Then we emit the condition expression and check if it is truthy. If it is, we continue onto the consequent block, and otherwise branch to the alternate block. Afterwards, we always branch to the rest of the code.
We can name the labels anything, but we need to be careful because we need them to be unique so that if there are two if statements in a program, the labels do not conflict with each other. Therefore we carry around a counter in our Context that we use to create a unique label, with this makeLabel function:
The helper function incrementLabelCounter simply adds one to the label counter and adjusts the state in the Emit monad. Now here is the full code for the If case of the emit function, where the b instruction is short for "branch":
As described above, we create two unique labels, then arrange the code as necessary within these new labels.
Variable Bindings
Next up in our whirlwind compiler tour is variable bindings. In our Context we store a Map called locals where the keys are the names of the variables and the values are their offset from the current frame pointer fp. Our plan is to store local variables in the current stack frame, and when we need to retrieve one, we lookup its name in locals and reference that memory location.
The only complication is that again, we store all the variables at two-word, 8-byte boundaries in the stack. To do this, we keep track of the nextLocalOffset, which is a running count of where to put the next local variable. In the end our emit case for Var looks like this:
Here we just emit the value into r0, push it onto the stack with alignment, then update our locals map and set the next local offset down 8 bytes. This is because the stack actually grows down in memory from the stack pointer.
Identifiers
Now we need to implement the reverse, pulling the value for a variable identifier. To do this, we implement a higher order procedure called withVarLookup. This function takes the name of a variable and a function that takes its offset as an input for some Emit action. After we look up the offset of the parameter, we pass it to this given function, and if the name does not exist in locals we throw an error:
With this we can implement the emit case for Identifier, which is an AST node which holds the name of a variable:
So to compile an identifier, we look up the name of the variable, then load r0 with the value of that variable, which is offset from fp. That bracket syntax allows us to reference a memory address, so for example [fp, #-4] would reference the memory location at the frame pointer minus 4 bytes. This is exactly what we want, except pluggin in the appropriate offset.
Function Calls
Even though we haven't yet implemented function definitions, we are going to first implement function calls. A function call looks like foo(1, 2) in JavaScript and is parsed into a Call node by our parser which has two components: the name of the callee and the list of arguments.
In ARM assembly, the convention is that the first four registers are for function arguments. If we have more than four arguments, they must be put on the stack, and for simplicity this difficulty is avoided in the baseline compiler in Keleshev's book.
In our compiler, we must still use the stack, however, since if we try and move the arguments directly into the registers they could clobber each other. So we will allocate memory on the stack, then emit each argument and store it in the appropriate spot. Then we pop the values back into the argument registers and call the bl instruction to "branch and link" to the name of the callee. In the end it looks like this:
We have an optimization for functions of zero or one arguments, in those cases there is no need to use the stack. Otherwise we do as described above, and throw and Unsupported error if there are more than four arguments.
Function Definitions
Next we will actually be able to use these function calls by implementing function definitions. Defining a function has three main parts: the prologue, the function body, and the epilogue. For any function, we will have to perform the same prologue and epilogue to set up and tear down the function correctly.
Before executing the function body, we need to save the current values of two important registers: the frame pointer fp and the link register lr. We've already seen fp, which holds the address in the stack allocated for the current function call. The link register could also be called the return address, and tells us where to go back to after we call the function.
So in our function prologue, we push fp and lr onto the stack to save their current values, then set the frame pointer to the current stack pointer. Then we save the current values in the argument registers r0 through r4, since these are about to be clobbered with new values in the function body. The prologue looks like this:
In the epilogue we essentially undo our work in the prologue. We deallocate the stack space we used in the current frame by moving fp into sp. Then, in case our function did not have a return statement, we implicitly return 0 by moving 0 into r0. Finally we restore the previous fp and pop the link register into the program counter register, pc, which sets us back to where we were in the program before the function. It looks like this:
In between these we need to emit the body. To execute the body, we need to set up a new local environment loaded with the parameters to the function. We do that in a function called bindParams, which takes a list of parameters and returns a new Context with the updated map of locals and nextLocalOffset. We store each paramter in slots of four bytes from the frame pointer:
We actually return the Context wrapped in the Emit monad, since we want to keep the previous labelCounter preserved. That should continue to tick up between different function definitions, since otherwise we'll get repeat labels.
We need one more helper function, which we call withContext. This function takes two arguments, a context and some Emit action. All it does is save the value of the old context, put the new context, run the action, then restore the old context. This makes it so we can "pass" a fresh context to an action without clobbering the old one. Again, we avoid replacing the context wholesale to preserve that peksy labelCounter between function calls. In the end it looks like this:
With all of the in place, all we need to do is put them together in our compilation of the function body, which is just this:
Putting together the prologue, body, and epilogue, we finally arrive at the emit case for the AST node Function, which stores a function definition with its name, parameters and body:
Again, we don't support more the 4 parameters, so we throw an error in that case. Otherwise we just create a new label for the function, and under that label put the code we've discussed above.
Return
We implement a Return AST node exactly as we implemented the function definition prologue, so all we need is
Blocks
The last thing we'll look at is how to compile blocks. This is the simplest of all, since we just need to emit each statement contained in the block:
Testing
And with that, we've implemented a Turing-complete language. We can store values in memory, and we can loop easily enough with recursion. Putting everything together we can look at a side-by-side comparison of a simple factorial program and the assembly that the compiler generates:
When we assemble the generated assembly for this program, gcc will automatically link against libc, which gives us access to its standard functions like putchar. Here's the full assembly this produces:
Afterword
That's really all I have. If you want to try out the compiler yourself or dig around with the full source, you can find it on GitHub here. I have a few more things implemented in the compiler than I showed here, like while loops, assignment statements, and array literals.
If you happen to have a Raspberry Pi running a 32-bit OS, then you can run the generated assembly natively. Otherwise, you'll need to cross compile and emulate from your machine. On Debian/Ubuntu, download the packages gcc-arm-linux-gnueabihf and qemu-user then use the shell scripts included in the repository.
I realized while writing this post that it would essentially be a worse version of Keleshev's book, since I wanted to give a tour of the compiler that made sense but also couldn't go into nearly as much detail as he did. So if you could follow along here, you'll probably find his book more enlightening.
It's not a perfect book, but it gives a great introduction to understanding and compiling to assembly that is hard to find in a lot of other places. In the second part of the book, which I don't cover here, he looks at some more advanced concepts like garbage collection, type checking, and working with heap objects.
Anyway, if you found this interesting, follow me on Twitter at @micah_cantor or feel free to reach out via email.