Notes on the LHC: LLVM

Showing posts with label LLVM. Show all posts

Saturday, January 17, 2009

LLVM is great.

It seems that the LLVM crowd have mistaken my blog posts for criticism of LLVM. Let me make it clear that I have nothing but respect for LLVM.

I've previously mentioned that LLVM doesn't support zero-overhead garbage collection. Big deal. It's about the same as saying LLVM doesn't answer whether P=NP. I apparently failed in conveying that this is an unsolved problem in computer science.

Solving this problem isn't as simple as putting heap pointers in registers (although it is required). The real beef lies in determining which registers are heap pointers when it isn't known statically. Determining at run-time which registers are heap pointers is intimately tied to the data model of the high-level language. Doing this well in an agnostic way is an unsolved problem. (Note that pointer tagging is generally avoided).

Several people apparently took it personally when I mentioned writing yet another NCG. Let it be clear that I'd never rewrite any of LLVM nor wish to belittle the effort it takes to write a general compiler. I was merely talking about writing a non-optimizing translator from an extremely limited IR to machine code.

So:

LLVM is great. I do not wish to criticise any part of it.
What you guys have created is impressive. I do not wish to belittle your efforts.
LLVM is not a silver bullet. It does not solve all open questions in the academic world and no one expects it to.

Why LLVM probably won't replace C--.

It seems that I didn't do a good job at explaining why LHC won't be using LLVM. In this post I will elaborate on why some people think C-- has more promise than LLVM as a substrate for lazy, functional languages.

Let me start by making one thing clear: LLVM does have support for garbage collectors. I am not disputing that. However, as Henderson has shown, so does C and every other language. The question we have to ask is not "Does this environment support garbage collection?" but rather "How efficiently does this environment support garbage collection?".

To recap, Henderson's technique involves placing root pointers (the set of pointers which can be followed to find all live data) on a shadow stack. Since we manage this stack ourself, it shouldn't be a problem for the GC to walk it.
In short, each heap allocation incurs an unnecessary stack allocation and heap pointers are never stored in registers for long.

Now what does this mean for languages like Haskell? Well, unlike programs written in more traditional languages, a Haskell application might very well do between 10 and 20 million heap allocations per second.
Writing Haskell programs is more about producing the correct data stream than it is about performing the right side-effects. It's common for functions in Haskell to manipulate data without execuing any side-effects. (Think spreadsheets.)
This way of computing obviously requires a very cheap method of allocation. Performing 10 million unnecessary stack allocations per second would severely hurt performance, and not having heap pointers in registers could easily be equally devastating.

So what about LLVM? Shouldn't the built-in GC support in LLVM be more efficient than any cheap hack? Well, it turns out it isn't. The conflict between garbage collection and optimizations haven't changed, and neither have the solution: disabling or bypassing optimizations. This in turn means unnecessary stack allocations and sub-optimal use of registers.

That LLVM haven't solved the problem of zero-overhead garbage collection isn't too surprising. Solving this while staying agnostic of the data model is an open question in computer science.
It is here C-- differs from LLVM. C-- is a research project that aims at solving difficult problems such as supporting efficient GCs and cheap concurrency. LLVM, on the other hand, is an engineering project.

In conclusion: garbage collection in LLVM incurs unacceptable overhead, and while C-- and LLVM do have some overlap, the problems they're trying to solve are quite different.

Thursday, January 15, 2009

The case against C/LLVM.

Many people have pointed out that garbage collection can be implemented in a portable fashion in uncooperative environments such as C or LLVM. Specific details of how to do this can be found in Fergus Henderson's paper "Accurate garbage collection in an uncooperative environment".

Henderson's technique is simply to put each pointer to newly allocated data on the stack. This bypasses all concerns about which variables will be put in registers and which will be spilled to the stack. However, the overhead of never using registers and performing unnecessary stack allocations can be quite significant.

Since the goal of LHC is to generate efficient executables, shoehorning our executive model into a "hostile" environment like C or LLVM isn't an option. Writing a simplistic NCG won't be too much trouble, hopefully.