Monday, January 19, 2009

Functions in Haskell.

Function calls in Haskell are typically far more numerous than in more traditional languages. This is in part due to laziness. Being lazy means that functions in Haskell do as little as possible to return a result. So to get all the data you need, you often have to call the functions multiple times.

Consider the following snippet of code:

upto :: Int -> Int -> [Int]
upto from to
= if from > to
then []
else from : upto (from+1) to

Here's what the compiled function would look like:

-- Arguments and results are (usually) kept in registers.
-- We generate a new calling convention for each function.
upto from to
= case from `gtInt` to of
1 -> do -- Return a single tag representing '[]'.
return [CNil]
0 -> do -- Allocate 5 heap cells.
heap <- allocate 5
-- CInt is the constructor for Int.
heap[0] := CInt
heap[1] := from
-- Fupto represents a suspended call to 'upto'.
heap[2] := Fupto
heap[3] := from+1
heap[4] := to
-- Return a node as three separate pieces.
-- &heap[0] is the head of the list and &heap[2] is the tail.
return [CCons, &heap[0], &heap[2]]

As we can see, calling this function will only give us a single node (the node in this case is either a CNil or a CCons with two arguments). We will have to call it again to get more information out of it. For example, fully computing 'upto 1 10' requires 11 calls to 'upto' (10 CCons nodes and 1 CNil).

Looking at the steps in 'upto' shows us that it isn't doing a whole lot. All variables (even arguments and results) are in registers and the data can easily fit in the cache. We could almost say that calling this function is as fast as looping in C. Let's add a bit more code and see what happens:

main = showMyList (upto 1 10)
showMyList [] = return ()
showMyList (x:xs)
= do print x
showMyList xs

The same code, now compiled to our intermediate language:

main = do heap <- allocate 3
heap[0] := Fupto
heap[1] := 1
heap[2] := 10
showMyList &heap[0]

showMyList lst
= do -- Read the arguments to 'upto' from the 'lst' pointer.
Fupto from to <- fetch lst
-- Call 'upto'. The results are kept in registers.
-- 'a' and 'b' are undefined if 't' is CNil.
[t, a, b] <- upto from to
-- Inspect the node tag.
case t of
CNil -> return [CUnit]
CCons -> do -- Call print on 'a'. This call might invoke the garbage collector.
print a
-- Recurse on the tail.
showMyList b


This looks rather well. We could be proud if this was the end. However, there is one thing that we haven't considered: garbage collection. The garbage collector may run when we call 'print' and if it does then the pointer we have in 'b' will no longer be valid.
A common solution is to push 'b' to a stack (which the GC can walk and modify) and reload it after 'print' has returned. However, a stack allocation cost about as much as the call to 'upto' and hence incurs an overhead of nearly 50%.

Fortunately there's a way around this. We can "simply" have the garbage collector update all registers that contain heap pointers. Doing so isn't exactly easy but it does allow us to keep pointers in registers and to avoid all unnecessary stack allocations.
The details of how to accomplish this will have to wait for another time.

13 comments:

  1. Great post. I got one question though.

    When showMyList fetches the lst pointer, supposedly lst points to a generic thunk. But here you assumed the thunk must be a call to upto. Did you skip the details for ease of representation, or was there other things I didn't know of?

    ReplyDelete
  2. Wei Hu:

    LHC uses whole-program optimizations. If there is more than one possibility for 'lst' then we would look at the tag to determine which function to run. It would look something like this 'case tag of Fupto from to -> upto from to; Fotherfunc a b c -> otherfunc a b c'.

    Making all function calls explicit means we can use very fast calling conventions.

    ReplyDelete
  3. How aggressive is the strictness analysis?

    ReplyDelete
  4. As of yet, we only perform a very conservative strictness analysis. Further exploration of this is needed.

    ReplyDelete
  5. How come you put (from+1) in heap[3]?
    Some global kind of strictness analysis could do this, but if it's very conservative I think you'd need to build a closure for (from+1) instead.

    ReplyDelete
  6. Ooops, I take it back. The from variable is obviously evaluated. No problem there.

    ReplyDelete
  7. I don't understand. How can a stack allocation be almost as expensive as the call to upto?
    The call to upto involves a call instruction, a comparison (if the from and to values are boxed this also involves memory loads), then a heap pointer compare, 5 stores to the heap, and some return stuff.
    That's a fair number of instructions, the push and pop of b is maybe at most 10% of those instructions.
    That's assuming 'print a' doesn't need to save b because it's running out of registers. In the X86 it most likely will, so saving b in showMyList will cost you nothing extra.
    Only if 'print' happens to be exceedingly simple would you loose.

    ReplyDelete
  8. augustss:
    He has an AMD64 and it's giving him delusions of having enough registers for anything ;-P.

    ReplyDelete
  9. augustss:

    The 'upto' call does some register fiddling + a write main memory. A stack allocation does a bit less register fiddling + a smaller write to main memory. My point is that a stack allocation is conceptually very similar to the code we're interested in running.

    I'm betting that function calls are small and numerous, and that machines will get more and more registers over time.

    ReplyDelete
  10. I don't understand which upto you're talking about. The one in you example? It does 5 memory writes.
    BTW, doesn't it need to evaluate from and to as well?
    Or does your strictness analysis figure out that it doesn't have to?

    I suggest you translate these little functions to assembly code, and try with and without saving b. That would be an experiment, rather than opinions (which I'm equally full of :).

    ReplyDelete
  11. I will definitely do some experiments when the NCG is ready.

    ReplyDelete
  12. I thought jhc was to use some form of memory region inference instead of garbage collection.... whatever happened to that? did it turn out to be impractical?

    ReplyDelete
  13. jcpetruzza:

    Yes, it turned out to be impractical.

    ReplyDelete