Notes on the LHC: October 2010

Tuesday, October 19, 2010

Rough organizational overview.

The exact details are constantly changing but here's a rough overview of the LHC pipeline.

External Core.
We've designed our compiler to use GHC as its frontend. This means that GHC will handle the parsing and type-checking of the Haskell code in addition to some of the optimization (GHC particularly excels at high-level local optimizations). LHC benefits greatly by automatically supporting many of the Haskell extensions offered by GHC.
Notable characteristics: Non-strict, local functions, complex let-bindings. Pretty much just Haskell code with zero syntactic sugar.
Example snippet:
```
  base:Data.Either.$fShowEither :: ghc-prim:GHC.Types.Int =
    ghc-prim:GHC.Types.I# (11::ghc-prim:GHC.Prim.Int#);
```
Simple Core.
Since External Core isn't immediately ready to be processed into GRIN code, we first translate it to Simple Core by removing or simplifying out a couple of features. The most noticeable feature of External Core is locally scoped functions which simply does not fit in with the GRIN model. When translating to Simple Core, we hoist out all local functions to the top-level.
Notable characteristics: Non-strict, no local functions, simplified let-bindings.
Grin Stage 1.
Let me start by introducing GRIN: GRIN (Graph Reduction Intermediate Notation) is a first order, strict, (somewhat) functional language.
The purpose of this first stage of grin code is to encode the laziness explicitly. It turns out that you can translate a lazy language (like Simple Core) to a strict language (like GRIN) using only two primitives: Eval and apply. The 'eval' primitives takes a closure, evaluates it if need be and returns the resulting object. The 'apply' primitives simply adds an argument to a closure. Haskell compilers such as GHC, JHC and UHC all use this model for implementing laziness.
Notable characteristics: Strict, explicit laziness, opaque closures.
Example snippet:
```
base:Foreign.C.Types.@lifted_exp@ w ws =
  do x2508 <- @eval ws
     case x2508 of
       (Cbase:GHC.Int.I32# x#)
         -> do x2510 <- unit 11
               base:GHC.Show.$wshowSignedInt x2510 x# w
```
Grin Stage 2.
At the time of writing, each of the mentioned compilers stop at the previous stage (or at what would be their equivalent of that stage).[1] LHC follows in the footsteps of the original GRIN compiler and applies a global control-flow analysis to eliminate/inline all eval/apply primitives. In the end, a lazy/suspended function taking, say, two arguments simply becomes a data constructor with two fields.
Notable characteristics: Strict, transparent closures.
Example snippet:
```
base:Foreign.Marshal.Utils.toBool1_caf =
  do [x2422] <- constant 0
     [x2423] <- @realWorld#
     [x2424 x2425] <- (foreign lhc_mp_from_int) x2422 x2423
     [x2426] <- constant Cinteger-gmp:GHC.Integer.Type.Integer
     unit [x2426 x2425]
```
Grin Stage 3.
Things are starting to get fairly low-level already at stage 2. However, stage 2 is still a bit too high-level for some optimizations to be easily implemented. Stage 3 breaks the code into smaller blocks that can easily be moved, inlined and short-circuited. The code is now sufficiently low-level that it can be pretty-printed as C.
Notable characteristics: Functions are broken down to functional units. Otherwise same as stage 2.
Example snippet:
```
base:GHC.IO.Encoding.Iconv.@lifted@_lvl60swYU38 rb3 rb4 =
  do [x21578] <- @-# rb4 rb3
     case x21578 of
       0 -> constant Cghc-prim:GHC.Bool.False
       () -> constant Cghc-prim:GHC.Bool.True
```
Grin--.
Grin-- is the latest addition to the heap and not much is known about it for certain. It is even up for debate whether it belongs to the GRIN family at all since it diverge from the SSA style.
The purpose of Grin-- is to provide a vessel for expressing stack operations.
Notable characteristics: Operates on global virtual registers, enables explicit stack management.
Example snippet:
```
base:GHC.IO.Encoding.Iconv.@lifted@_lvl60swYU38:
  do x21578 := -# rb4 rb3
     case x21578 of
       0 -> do x88175 := Cghc-prim:GHC.Bool.False
               ret
       () -> do x88175 := Cghc-prim:GHC.Bool.True
                ret
```

Feel free to ask if you have any questions on the how and why of LHC.

[1] UHC does have the mechanics for lowering the eval/apply primitives but it is not enabled by default.

Saturday, October 16, 2010

Accurate garbage collection.

So, let's talk about garbage collection. Garbage collection is a very interesting topic because it is exceedingly simple in theory but very difficult in practice.

To support garbage collection, the key thing a language implementor has to do is to provide a way for the GC to find all live heap pointers (called root pointers). This sounds fairly easy to do but can get quite complicated in the presence of aggressive optimizations and register allocation. A tempting (and often used) solution would be to break encapsulation and make the optimizations aware of the GC requirements. This of course becomes harder the more advanced the optimizations are and with LHC it is pretty much impossible. Consider the following GRIN code:


-- 'otherFunction' returns an object of type 'Maybe Int' using two virtual registers.
-- If 'x' is 'Nothing' then 'y' is undefined.
-- If 'x' is 'Just' then 'y' is a root pointer.
someFunction
 = do x, y <- otherFunction; ....

The above function illustrates that it is not always straightforward to figure out if a variable contains a root pointer. Sometimes determining that requires looking at other variables.

So how might we get around this hurdle, you might ask. Well, if the code for marking roots resides in user-code instead of in the RTS then it can be as complex as it needs be. This fits well with the GRIN ideology of expressing an much in user-code as possible.

Now that we're familiar with the problem and the general concept of the solution, let's work out some of the details. Here's what happens when a GC event is triggered, described algorithmically:

Save registers to memory.
This is to avoid clobbering the registers and to make them accessible from the GC code.
Save stack pointer.
Initiate temporary stack.
Local variables from the GC code will be placed on this stack.
Jump to code for marking root pointers.
This will peel back each stack frame until the bottom of the call graph has been reached.
Discard temporary stack.
Restore stack pointer
Restore registers.

Using the approach for exception involves stack cutting and a more advanced transfer of control which will be discussed in a later post.

In conclusion, these are the advantages native-code stack walking:

Allows for objects to span registers as well as stack slots.
Separates the concerns of the optimizer, the garbage collector and the code generator.
Might be a little bit faster than dynamic stack walking since the stack layout is statically encoded.

A few updates.

Not much has been put up on this blog lately but work is still going on under the hood. The most significant changes in the pipeline are proper tail-calls and a copying garbage collector.

As it stands now, LHC uses the C stack extensively but this is obviously not ideal as it makes garbage collection, exceptions and tail-calls nigh impossible to implement. Since the ideal solution of using a third party target language isn't available (neither LLVM or C-- supports arbitrary object models), I've decided to slowly inch closer to a native code generator for LHC. It is fortunate that I find Joao Dias' dissertation nearly as interesting as the GRIN paper.

The first step would be to make the stack layout explicit in the GRIN code. This is necessary but not sufficient for tail-calls (some register coalescing is also required. More on this later). More importantly, accurate garbage collection now becomes a possibility. The way I want to implement garbage collection (and exceptions for that matter) is through alternative return points. This is one of three methods discussed in a C-- paper by Norman Ramsey and Simon Peyton Jones for implementing exceptions. I believe this method is versatile enough for garbage collection as well.

The concept revolves around using specialized code at each call site that knows enough about the stack layout to mark root pointers and to jump to the next stack frame. I will describe the details in another blog post. An interesting point is that the garbage collectors could be written in user-code instead of in the RTS.

So, to recap: Accurate garbage collection is just around the corner and proper tail-calls will follow in its heels. These two missing features are the reason that so many of the benchmarks fail to run for LHC.