Portfolio

CS & Mathematics · Cornell University · U.S. Top Secret Security Clearance

Bo Nix: Building a Compiler for Eta and Rho

Bo Nix is a compiler for Eta and its extension Rho — two small, statically-typed, C-like languages designed for Cornell’s CS 4120. Eta is the core language; Rho adds records with field lvalues, null semantics, and a module system. The compiler is written in Java, uses Gradle for builds, and targets GNU assembler-compatible x86-64.

A note on the name: Bo Nix is named after a character in an elaborate piece of in-universe fiction the team wrote as lore for the project — a child born in a prison cell to a mathematician who had just exposed a conspiracy hidden inside a Critter Theory paper. Short, simple, strong enough to carry whatever future the child would build. The compiler needed a name with the same properties.

Phase 1: Lexing Eta

The first surprise in lexing Eta is numerical. The language spec allows integers in the range [263,2631][-2^{63}, 2^{63} - 1]. Java’s long is a signed 64-bit integer covering [263,2631][-2^{63}, 2^{63} - 1], which seems to fit — except the lexer sees tokens before negation is applied by the parser. The literal 9223372036854775808 (which is 2632^{63}) is a valid token even though it can only appear as the negated constant 263-2^{63}. Treating it as a long immediately overflows.

The solution: treat the raw token as a Java unsigned long using Long.parseUnsignedLong, which covers [0,2641][0, 2^{64} - 1], far more than enough. Numbers that exceed 2632^{63} are stored fine; the parser decides whether they are valid in context (e.g., as the argument of unary minus). This keeps the lexer simple and pushes semantic concerns to the right phase.

The lexer itself is generated by JFlex. The token hierarchy uses Java’s sealed interfaces:

sealed interface Token permits
    IntLiteral, StringLiteral, CharLiteral,
    Identifier, Keyword, Operator, EOF { ... }

Sealed interfaces give you two things simultaneously: exhaustive pattern matching (the compiler enforces you handle every case) and precise per-token attribute types (a StringLiteral carries its processed string value; a Keyword carries its keyword kind; the shapes are structurally different). This eliminates a whole class of runtime cast errors.

Phase 2: Parsing and the AST

The parser is generated with JCup (Java CUP) and layered over the JFlex lexer. We define terminals from the lexer, write grammar productions for Eta/Rho constructs, and construct AST nodes directly in CUP actions. We also use CUP precedence and associativity declarations to resolve expression parsing deterministically (including the classic if/else ambiguity).

Every AST node carries a source location (file, line, column). This is essential — every error message the compiler produces traces back to a specific position in the source. Losing location information anywhere in the pipeline produces useless diagnostics.

Phase 3: Semantic Analysis and the Type Checker

The type checker validates well-formedness and assigns a type to every AST node. Eta’s type system is simple (integers, booleans, arrays, functions — no type inference), but the implementation requires careful attention to:

  • Variable scoping: functions create a new scope; loops don’t leak their loop variable
  • Return type checking: every control path in a function body must return the declared type
  • Array types: int[] and int[][] are distinct; indexing must be typed correctly

Rho adds records, which extend the type checker with a nominal record environment. Field access e.f requires looking up e’s type in the record table and checking that f is a declared field.

Phase 4: The IR and CFG

Between the type-annotated AST and final assembly, the compiler uses a custom three-address IR: each instruction has at most two source operands and one destination, mapping cleanly to x86-64 binary instructions. The IR is organized into a control flow graph (CFG): a directed graph of basic blocks, where each basic block is a maximal straight-line sequence ending in a branch or return.

CFG structure enables the classical optimization passes:

Constant propagation: if a variable is always assigned the same constant value, replace all its uses with that constant and eliminate the assignment.

Dead code elimination: any instruction whose result is never used (and has no side effects) can be removed. Combined with constant propagation, this eliminates a surprising fraction of compiler-introduced temporaries.

Unreachable block removal: if a conditional branch’s condition is statically known (after propagation), one successor is unreachable and can be removed, enabling further simplification.

The compiler emits DOT files for CFGs, so you can visualize the graph before and after each optimization pass. This was invaluable for debugging: optimization bugs are subtle, and seeing the control flow visually makes incorrect transformations immediately obvious.

Phase 5: Register Allocation

Register allocation is the problem of mapping the CFG’s unbounded set of temporaries to the x86-64’s 14 general-purpose registers (the other two are reserved for the frame pointer and stack pointer), inserting spills to memory when temporaries exceed the available registers.

Bo Nix uses graph-coloring allocation (Chaitin’s algorithm). The key insight is that two temporaries interfere if they are simultaneously live at any program point — assigning them the same register would produce incorrect code. The interference relation defines a graph; register allocation is graph coloring, where the number of colors is the number of available registers.

Liveness analysis is the prerequisite. A temporary is live at program point pp if there exists some path from pp to a use of the temporary that doesn’t pass through a redefinition. This is a backward dataflow equation:

Livein(B)=Use(B)(Liveout(B)Def(B))\text{Live}_{\text{in}}(B) = \text{Use}(B) \cup (\text{Live}_{\text{out}}(B) \setminus \text{Def}(B)) Liveout(B)=Ssucc(B)Livein(S)\text{Live}_{\text{out}}(B) = \bigcup_{S \in \text{succ}(B)} \text{Live}_{\text{in}}(S)

Solving this to a fixed point gives the liveness sets; building the interference graph from liveness is then straightforward. Graph coloring proceeds by iteratively removing low-degree nodes (degree < number of registers), coloring in reverse order. Nodes that cannot be colored are spilled — their value is stored to a stack slot and reloaded on each use.

Phase 6: Extending to Rho

Rho adds records to Eta: struct-like values with named fields. Records lower to heap memory: each record is a pointer to a contiguous block of fields, laid out according to the x86-64 System V ABI (with alignment padding between fields as needed).

Field access e.f lowers to an address computation plus a load:

addr = base_ptr + field_offset(f)
value = load(addr)

Field assignment e.f = v lowers to:

addr = base_ptr + field_offset(f)
store(addr, value)

Null semantics required adding a null pointer type and inserting null-check guards at field accesses and method calls. Eta had no notion of null; Rho makes it explicit and checkable.

The final test harness compiles programs to native binaries and validates them against a reference interpreter inside Docker. Getting ABI-compliant calling conventions right — correct stack alignment, correct register usage at call boundaries, correct handling of caller-saved vs callee-saved registers — was the most painstaking part of the backend.

What a Compiler Teaches You

A compiler is a sequence of meaning-preserving transformations, each with its own internal invariants. When something breaks, the discipline of maintaining those invariants tells you exactly which phase introduced the bug: a semantic error is in the type checker, a shape error is in IR lowering, a wrong value at runtime is in codegen or register allocation. The phases are your debugging grid.

The other lesson is that the design space for each phase is smaller than it looks. Once you commit to a CFG-based IR, the optimization passes are almost forced. Once you commit to graph coloring, the liveness analysis is required. Good architectural decisions narrow the implementation to the interesting parts.