Version 0.6.20090126 has been released. It has been more than a month since our last release and we've made a lot of progress. The code is available from Hackage and can be installed as such:
cabal install lhc -fwith-base
Here's our changelog:
Fixed type classes.
Better variable ids.
Base library reorganization.
Better support for non-Linux systems.
Removed tagging on Int and Word.
Got Control.Arrow and Control.Applicative working by improving the handling of (->) as well as fixing type classes.
More extensive testsuite.
Lots of code clean-up.
Future effort will be directed at adding Integer support in the base library, improving the efficiency of LHC and restoring control-flow analysis.
Code selection through object code optimization by Davidson and Fraser describes a compiler architecture where instead of taking an AST and optimizing it, then generating code for a target machine, we instead take the AST and immediately generate worst-case instructions which we then subsequently optimize, and then emit into assembly language. These instructions we generate and optimize are called register-transfer lists or RTLs.
RTLs are quite simple - a solitary RTL is just a single abstract machine instruction. A sequence of RTLs might look like so:
An RTL must specify a simple property - the machine invariant, which states that any RTL maps to a single instruction on the target machine.
The compilation process is like so, the input language (for example, C--, GRIN or any AST for that matter) is taken and fed to the code expander which takes the input AST and generates a simple sequence of RTLs representing the AST - no attention at this point is paid to the generated code, and this keeps the code expander's job simple and easy. The code expander, however is a machine-dependent part of the backend - it requires knowledge of the target machine so that it may establish the machine invariant.
After expansion, the code is subsequently optimized using e.g peephole optimization, constant subexpression elimination and dead code elimination. These optimizer parts are machine independent. Every optimization must make sure it does not violate the machine invariant.
After optimization, we perform the task of code emission, it being the only other machine-dependent part of the compiler.
In application to LHC, it is quite obvious a new backend is necessary. I believe this compilation method is applicable and preferable. Combined with whole program analysis, and my plans for a small and simple as possible runtime, this makes the compiler easier to retarget, and I wish for LHC to be a cross-compiler as well. It's obvious C is not the way to go, so minimizing burden on retargeting seems like a good strategy. My inevitable plans are for you to be able to take a haskell application, use LHC to compile it and get nice ASM code for whatever architecture you wish.
The overall structure of the new backend I am thinking of is something like this (drawn very badly using Project Draw):
We will perform as many optimizations as possible on the whole-program while it is in GRIN form, leaving the backend to do the task of peephole optimization/DCE/etc. in a machine-independent way, and then emit assembly code.
On the note of the implementing an optimization engine for which to perform these operations, the approach described in An Applicative Control-flow Graph based on Huet's Zipper seems promising as well. Garbage collection as well needs definite addressing before we can get LHC to compile truly serious programs; this will take a little more time to think and talk about with David. Eventually, we may even look into using Joao Dias' thesis research to automatically generate compiler backends, like I believe GHC HQ wants to do.
Well, I finally figured out why the only two test cases that were working were HelloWorld and Kleisli. The compiler had been implicitly generating a lot of hard-wired instances for Int,Word,CInt, and all their numerous cousins -- but it was generating the methods too late (during conversion from HsSyn language to the E intermediate language) for the methods to be properly associated with their classes (which FrontEnd.Class does before typechecking even properly begins). Since none of us much liked the idea of having all this hardwired into the compiler, we decided that rather than try to adjust the machinery to work with the new handling of methods, we'd rather implement the instances in the library. So, that is what we have to do for every type in the following list:
Lhc.Prim.Int
Lhc.Basics.Integer
Data.Int.Int8
Data.Int.Int16
Data.Int.Int32
Data.Int.Int64
Data.Int.IntMax
Data.Int.IntPtr
Data.Word.Word
Data.Word.Word8
Data.Word.Word16
Data.Word.Word32
Data.Word.Word64
Data.Word.WordMax
Data.Word.WordPtr
Foreign.C.Types.CChar
Foreign.C.Types.CShort
Foreign.C.Types.CInt
Foreign.C.Types.CUInt
Foreign.C.Types.CSize
Foreign.C.Types.CWchar
Foreign.C.Types.CWint
Foreign.C.Types.CTime
So bear with us if this takes a while to iron out. We have managed to get mini-base to build and many of the tests to run now, though getArgs apparantly doesn't compile yet.
Function calls in Haskell are typically far more numerous than in more traditional languages. This is in part due to laziness. Being lazy means that functions in Haskell do as little as possible to return a result. So to get all the data you need, you often have to call the functions multiple times.
Consider the following snippet of code:
upto :: Int -> Int -> [Int] upto from to = if from > to then [] else from : upto (from+1) to
Here's what the compiled function would look like:
-- Arguments and results are (usually) kept in registers. -- We generate a new calling convention for each function. upto from to = case from `gtInt` to of 1 -> do -- Return a single tag representing '[]'. return [CNil] 0 -> do -- Allocate 5 heap cells. heap <- allocate 5 -- CInt is the constructor for Int. heap[0] := CInt heap[1] := from -- Fupto represents a suspended call to 'upto'. heap[2] := Fupto heap[3] := from+1 heap[4] := to -- Return a node as three separate pieces. -- &heap[0] is the head of the list and &heap[2] is the tail. return [CCons, &heap[0], &heap[2]]
As we can see, calling this function will only give us a single node (the node in this case is either a CNil or a CCons with two arguments). We will have to call it again to get more information out of it. For example, fully computing 'upto 1 10' requires 11 calls to 'upto' (10 CCons nodes and 1 CNil).
Looking at the steps in 'upto' shows us that it isn't doing a whole lot. All variables (even arguments and results) are in registers and the data can easily fit in the cache. We could almost say that calling this function is as fast as looping in C. Let's add a bit more code and see what happens:
main = showMyList (upto 1 10) showMyList [] = return () showMyList (x:xs) = do print x showMyList xs
The same code, now compiled to our intermediate language:
main = do heap <- allocate 3 heap[0] := Fupto heap[1] := 1 heap[2] := 10 showMyList &heap[0]
showMyList lst = do -- Read the arguments to 'upto' from the 'lst' pointer. Fupto from to <- fetch lst -- Call 'upto'. The results are kept in registers. -- 'a' and 'b' are undefined if 't' is CNil. [t, a, b] <- upto from to -- Inspect the node tag. case t of CNil -> return [CUnit] CCons -> do -- Call print on 'a'. This call might invoke the garbage collector. print a -- Recurse on the tail. showMyList b
This looks rather well. We could be proud if this was the end. However, there is one thing that we haven't considered: garbage collection. The garbage collector may run when we call 'print' and if it does then the pointer we have in 'b' will no longer be valid. A common solution is to push 'b' to a stack (which the GC can walk and modify) and reload it after 'print' has returned. However, a stack allocation cost about as much as the call to 'upto' and hence incurs an overhead of nearly 50%.
Fortunately there's a way around this. We can "simply" have the garbage collector update all registers that contain heap pointers. Doing so isn't exactly easy but it does allow us to keep pointers in registers and to avoid all unnecessary stack allocations. The details of how to accomplish this will have to wait for another time.
It seems that the LLVM crowd have mistaken my blog posts for criticism of LLVM. Let me make it clear that I have nothing but respect for LLVM.
I've previously mentioned that LLVM doesn't support zero-overhead garbage collection. Big deal. It's about the same as saying LLVM doesn't answer whether P=NP. I apparently failed in conveying that this is an unsolved problem in computer science.
Solving this problem isn't as simple as putting heap pointers in registers (although it is required). The real beef lies in determining which registers are heap pointers when it isn't known statically. Determining at run-time which registers are heap pointers is intimately tied to the data model of the high-level language. Doing this well in an agnostic way is an unsolved problem. (Note that pointer tagging is generally avoided).
Several people apparently took it personally when I mentioned writing yet another NCG. Let it be clear that I'd never rewrite any of LLVM nor wish to belittle the effort it takes to write a general compiler. I was merely talking about writing a non-optimizing translator from an extremely limited IR to machine code.
So:
LLVM is great. I do not wish to criticise any part of it.
What you guys have created is impressive. I do not wish to belittle your efforts.
LLVM is not a silver bullet. It does not solve all open questions in the academic world and no one expects it to.
It seems that I didn't do a good job at explaining why LHC won't be using LLVM. In this post I will elaborate on why some people think C-- has more promise than LLVM as a substrate for lazy, functional languages.
Let me start by making one thing clear: LLVM does have support for garbage collectors. I am not disputing that. However, as Henderson has shown, so does C and every other language. The question we have to ask is not "Does this environment support garbage collection?" but rather "How efficiently does this environment support garbage collection?".
To recap, Henderson's technique involves placing root pointers (the set of pointers which can be followed to find all live data) on a shadow stack. Since we manage this stack ourself, it shouldn't be a problem for the GC to walk it. In short, each heap allocation incurs an unnecessary stack allocation and heap pointers are never stored in registers for long.
Now what does this mean for languages like Haskell? Well, unlike programs written in more traditional languages, a Haskell application might very well do between 10 and 20 million heap allocations per second. Writing Haskell programs is more about producing the correct data stream than it is about performing the right side-effects. It's common for functions in Haskell to manipulate data without execuing any side-effects. (Think spreadsheets.) This way of computing obviously requires a very cheap method of allocation. Performing 10 million unnecessary stack allocations per second would severely hurt performance, and not having heap pointers in registers could easily be equally devastating.
So what about LLVM? Shouldn't the built-in GC support in LLVM be more efficient than any cheap hack? Well, it turns out it isn't. The conflict between garbage collection and optimizations haven't changed, and neither have the solution: disabling or bypassing optimizations. This in turn means unnecessary stack allocations and sub-optimal use of registers.
That LLVM haven't solved the problem of zero-overhead garbage collection isn't too surprising. Solving this while staying agnostic of the data model is an open question in computer science. It is here C-- differs from LLVM. C-- is a research project that aims at solving difficult problems such as supporting efficient GCs and cheap concurrency. LLVM, on the other hand, is an engineering project.
In conclusion: garbage collection in LLVM incurs unacceptable overhead, and while C-- and LLVM do have some overlap, the problems they're trying to solve are quite different.
Many people have pointed out that garbage collection can be implemented in a portable fashion in uncooperative environments such as C or LLVM. Specific details of how to do this can be found in Fergus Henderson's paper "Accurate garbage collection in an uncooperative environment".
Henderson's technique is simply to put each pointer to newly allocated data on the stack. This bypasses all concerns about which variables will be put in registers and which will be spilled to the stack. However, the overhead of never using registers and performing unnecessary stack allocations can be quite significant.
Since the goal of LHC is to generate efficient executables, shoehorning our executive model into a "hostile" environment like C or LLVM isn't an option. Writing a simplistic NCG won't be too much trouble, hopefully.
Well, I've been having some trouble with the typeclass implementation. I got rid of the long-standing bug where ill-typed rules were generated in E.FromHs, and I've gotten FrontEnd.Class to generate rules for method implementations which appear in instance declarations, but I haven't figured out how to implicitly fall back to the default implementation. The one approach I actually tried was to generate method implementations like:
but unfortunately, the typechecker stated that Instance@.iLhc.Order./=.default was not in the type-checking environment, and I couldn't figure out why :-(.
It might be difficult to get a sense of how feature-complete LHC is just by looking at the wiki and the mailing list. In this post I will try to answer the common questions of what LHC supports and where LHC is going.
Let me start by introducing LHC itself. The LHC Haskell Compiler (LHC for short) was born out of JHC on a cold Tuesday morning, November 18th, 2008. Her birth was necessitated by a conflict of opinion between John Meacham (the author of JHC) and myself. We simply couldn't agree on which build system would be the most appropriate: make or cabal. Since then we've made many changes that are unlikely to be merged back into JHC.
As of this writing, LHC differs from JHC in the following ways:
Cabalized build system instead of Make.
Extensive use of Haskell libraries. We have six more dependencies than JHC.
We use 'derive' instead of the unwieldy 'DrIFT' package. This, among other things, makes it possible to generate HPC information.
We support both GHC-6.8.x and GHC-6.10.x.
We're available on Hackage.
We've eliminated some usage of unsafe IO. On a project of this magnitude, getting segfaults instead of type-errors isn't acceptable.
We're very liberal in granting commit access. It's easier to undo bad patches than to make people contribute.
Although we are not JHC, we are still similar enough to share many of the same disorders. Most importantly, we don't have support for exceptions, garbage collection or proper type-classes.
The issues with type-classes is purely due to bugs in our code, and hopefully we can resolve them without too much headache. It should be noted that since Monad is a type-class, there are very few programs that LHC can compile at this stage.
Exceptions and garbage collection are more fundamental problems, though. We simply cannot implement those feature efficiently in C. Using C-- is an obvious solution but unfortunately there are no usable C-- compilers. Using LLVM[1] is out of the question for the same reasons as C. We will probably end up writing our own native code generator.
All in all, our future plans are like this:
Disable features until we can correctly compile Haskell98 programs.
Extend our testsuite to cover most of the codebase.
Address performance issues. We are about 5-10 times slower than GHC and there are no fundamental reasons for this.
Write a code generator that supports exceptions and garbage collection.
[1]: On the surface it looks like LLVM has decent support for exceptions and garbage collecting. However, the documentation shows that it is essentially no better than in C.
Variable identification tags can contain four different types of information. In Haskell we would write it as such:
data Id = Empty -- Unused binding. Eg: '\ _ -> ...'. | Etherial Int -- Internal variable. Only used when type-checking. | Anonymous Int -- Anonymous variable created by the compiler. | Named Name -- Named variable created by the user.
However, in LHC this data structure was unrolled and packed into an Int. The encoding went as following:
Empty = zero Etherial = negative numbers Anonymous = even, positive numbers Named = odd, positive numbers, used as keys in a global hash table.
This encoding gives us very fast operations on Sets and Maps but it also punishes mistakes with a vengeance. The increased performance is definitely not worth it and we've been working on untangling the Ids from day-1.
As of today, I'm glad to say that we've finally restored the beautiful ADT and we can now hack without fear of segfaulting.