It seems that I didn't do a good job at explaining why LHC won't be using LLVM. In this post I will elaborate on why some people think C-- has more promise than LLVM as a substrate for lazy, functional languages.
Let me start by making one thing clear: LLVM does have support for garbage collectors. I am not disputing that. However, as Henderson has shown, so does C and every other language. The question we have to ask is not "Does this environment support garbage collection?" but rather "How efficiently does this environment support garbage collection?".
To recap, Henderson's technique involves placing root pointers (the set of pointers which can be followed to find all live data) on a shadow stack. Since we manage this stack ourself, it shouldn't be a problem for the GC to walk it.
In short, each heap allocation incurs an unnecessary stack allocation and heap pointers are never stored in registers for long.
Now what does this mean for languages like Haskell? Well, unlike programs written in more traditional languages, a Haskell application might very well do between 10 and 20 million heap allocations per second.
Writing Haskell programs is more about producing the correct data stream than it is about performing the right side-effects. It's common for functions in Haskell to manipulate data without execuing any side-effects. (Think spreadsheets.)
This way of computing obviously requires a very cheap method of allocation. Performing 10 million unnecessary stack allocations per second would severely hurt performance, and not having heap pointers in registers could easily be equally devastating.
So what about LLVM? Shouldn't the built-in GC support in LLVM be more efficient than any cheap hack? Well, it turns out it isn't. The conflict between garbage collection and optimizations haven't changed, and neither have the solution: disabling or bypassing optimizations. This in turn means unnecessary stack allocations and sub-optimal use of registers.
That LLVM haven't solved the problem of zero-overhead garbage collection isn't too surprising. Solving this while staying agnostic of the data model is an open question in computer science.
It is here C-- differs from LLVM. C-- is a research project that aims at solving difficult problems such as supporting efficient GCs and cheap concurrency. LLVM, on the other hand, is an engineering project.
In conclusion: garbage collection in LLVM incurs unacceptable overhead, and while C-- and LLVM do have some overlap, the problems they're trying to solve are quite different.
"...a Haskell application might very well do between 10 and 20 million heap allocations per second."
ReplyDeleteI'm floored. If each allocation is ~10ns, you're still talking about doing allocations 20% of your total CPU time. What kind of times does C-- have for allocations? And is it done in parallel? I'm totally intrigued, since I'm working on my own language project and haven't yet put together a garbage collector for it.
"Performing 10 million unnecessary stack allocations per second would severely hurt performance, and not having heap pointers in registers could easily be equally devastating."
I think it'd be great to have those kinds of optimizations in a general purpose codegen like LLVM (or other codegens, that's just the first one that came to mind), so that other languages can also benefit. Even if it's something that isn't turned on by default, it's still something that sounds like a boon for other coders.
I have no quarrel with LLVM. See my "LLVM is great" post.
ReplyDeleteJonathan, ideally each allocation would be around 5 cycles or so (you're just checking if there's enough space, and if so incrementing a pointer), so more like 2ns on a 2.5Ghz machine (assuming cache misses are rare).
ReplyDeleteCan you substantiate your claim that "garbage collection in LLVM incurs unacceptable overhead"?
ReplyDeleteJonathan:
ReplyDeleteThose numbers are from the most memory intensive program I could find. I will post a more thorough analysis when I get the time.
Flying Frog Consultancy Ltd:
That line could be changed to "garbage collection in LLVM incurs overhead". We know how to do it without overhead and, since Haskell applications usually spend a lot of time in GC, we think it would be worth trying out.
Also, GRIN features such as unboxed nodes can't be encoded with LLVM. Targeting LLVM wouldn't just be another backend; it would require us to change our intermediate language.
Why would you use the shadows stack if you want fast allocation? You should make safe points at function returns (and loops if you want to do better than ghc), and then use the stack maps when you gc. If you do an incremental collector you need to implement your own lowering method for heap updates. LLVM has support for these things already, but you'll probably have to write a little C++ to get it fully customized.
ReplyDeleteTo Jonathan: The cost of memory allocation is basically just the cost of the stores. You also need a heap check and a heap increment, but you amortize those over several allocations, so it pretty much only the cost of the stores left. With the right kind of cache discipline the stores can be very cheap. The store writes in the cache, and full cache lines get flushed.
ReplyDeleteYou can easily have programs allocating, say 2Gb/s.
augustss:
ReplyDeleteAre you asking why I would use a shadow stack? If so, I don't understand that question. I'm arguing that /any/ unnecessary stack use is sub-optimal. It surprises me that this claim is controversial.
Consider a small function which may invoke the garbage collector. This function doesn't use all registers and callers know this. The caller may decide to keep heap pointers in some of registers for the duration of the call. How is the garbage collector to know which registers contain heap pointers? With unboxed nodes, the contents of some registers even determine the type of others.
We should only push data on the stack when we run out of registers, and garbage collection, like exceptions, should not impose an overhead on the mutator.
Are you interested in further discussion via email (or IRC)?
Oh, I misunderstood your point then. I thought you were referring to using the shadow stack (since you talked about it in one paragraph). I think Fergus' shadow stack is more overhead than one wants.
ReplyDeleteOn the other hand, the cost of having pointers on the stack at the function return safe point is totally different. My gut feeling is that this is much, much smaller. But I would not dare to say one way or another without experiments.
It all boils down to how many live heap pointers you have at the function call, and if those might have had to be saved anyway by the caller/callee.
What evidence do you have that the number of heap pointer registers that need saving at a function call is significant? The number of pointers that need to be put on the stack is nowhere near the number of allocations made.
augustss:
ReplyDeleteI have no additional data to what was published in Boquist's paper. However, my experience tells me that function calls in Haskell are small and numerous, and hence it would be beneficial to make them as cheap as possible.
I don't think that saving pointers on the stack at a function call is a significant cost. Often the pointers are on the stack already (if they were stack-based function arguments). If the pointer was in a register-based function argument then it is caller-saves anyway. So the only case you have to worry about is pointers to objects created by the current function, or returned by previous calls.
ReplyDeleteI think that saving pointers on the stack at a function call is a significant cost. Too bad neither of us have anything but our gut feeling to go on. (:
ReplyDeleteBoquist talks about proper register usage as one of the most important optimizations, and, indeed, most of his thesis is about optimal register usage and its ramifications. We try quite hard to expose more data to registers (all functions return unboxed values, for example) and that would be a waste of time if we pushed everything to the stack.
There can be quite an impact from offloading to the stack and then _reading_ the values back off stack, especially on modern x64 platforms, but just checkpointing them down to the stack is pretty cheap.
ReplyDeleteTo put some numbers to the 'gut feelings': For the former I've seen a factor of 4 or so slowdown, which closely reflects the factor of 5 that Xavier Leroy saw with ocaml ( see p.39 of http://pauillac.inria.fr/~xleroy/talks/zam-kazam05.pdf), but almost all of that is caused by the fact that the AMD platform docs indicate that you should wait a few cycles after writing to a location on the stack before reading the values back in.
On the other hand, just checkpointing like Lennart suggests, where you only do the writes except for in the exceptional collection case, doesn't really incur much overhead at all. Usually it stays in cache and gets written out over the bus in lulls, and the access delays don't tend to affect you both because it is only on the exceptional path, and the time it takes to start of the collection is longer than the access delay.
simonmar:
ReplyDeleteWhat is this about arguments being on the stack already? David does not propose to pass arguments on the stack either!
Edward Kmett:
ReplyDeleteI don't understand what you mean by checkpointing. Each stack write is always followed by stack read. The exception case is when the GC changes the value on the stack.
If you checkpoint on, say, every heap allocation by ensuring all live registers have been written down to a shadow stack frame, you can generally keep executing with the values straight out of the registers. Though it may require tying your hands in other ways (limiting yourself to mark and sweep, etc.)
ReplyDeleteAn alternative approach that lets you employ LLVM internally is to make invocations using llvm.invoke w/ DWARF style 0-cost exceptions instead of calls. When you try to perform an action that trips a collection, you throw a stack-overflow exception and wind down the stack generating the shadow stack for collection purposes as you go.
The major problem with that approach is that it places the burden of additional memory pressure (for allocating stack frames) when you can least afford it, but with a bounded stack size it is a known amount of overhead. I've used this latter model in the back end of a compiler that compiled down to C++.
That said I'm a bit of a hypocrit, because my current pet project had to reject LLVM for similiar reasons to your complaints. ;)