RTLs are quite simple - a solitary RTL is just a single abstract machine instruction. A sequence of RTLs might look like so:
t[1] := reg[12]
t[2] := mem[8] * reg[0]
t[3] := t[1] + reg[3]
t[4] := t[2] * 10
An RTL must specify a simple property - the machine invariant, which states that any RTL maps to a single instruction on the target machine.The compilation process is like so, the input language (for example, C--, GRIN or any AST for that matter) is taken and fed to the code expander which takes the input AST and generates a simple sequence of RTLs representing the AST - no attention at this point is paid to the generated code, and this keeps the code expander's job simple and easy. The code expander, however is a machine-dependent part of the backend - it requires knowledge of the target machine so that it may establish the machine invariant.
After expansion, the code is subsequently optimized using e.g peephole optimization, constant subexpression elimination and dead code elimination. These optimizer parts are machine independent. Every optimization must make sure it does not violate the machine invariant.
After optimization, we perform the task of code emission, it being the only other machine-dependent part of the compiler.
In application to LHC, it is quite obvious a new backend is necessary. I believe this compilation method is applicable and preferable. Combined with whole program analysis, and my plans for a small and simple as possible runtime, this makes the compiler easier to retarget, and I wish for LHC to be a cross-compiler as well. It's obvious C is not the way to go, so minimizing burden on retargeting seems like a good strategy. My inevitable plans are for you to be able to take a haskell application, use LHC to compile it and get nice ASM code for whatever architecture you wish.
The overall structure of the new backend I am thinking of is something like this (drawn very badly using Project Draw):
We will perform as many optimizations as possible on the whole-program while it is in GRIN form, leaving the backend to do the task of peephole optimization/DCE/etc. in a machine-independent way, and then emit assembly code.
On the note of the implementing an optimization engine for which to perform these operations, the approach described in An Applicative Control-flow Graph based on Huet's Zipper seems promising as well. Garbage collection as well needs definite addressing before we can get LHC to compile truly serious programs; this will take a little more time to think and talk about with David. Eventually, we may even look into using Joao Dias' thesis research to automatically generate compiler backends, like I believe GHC HQ wants to do.
Isn't this what LLVM does? Feed it simple register transfer instructions - magic happens and optimized executable code files are emitted.
ReplyDelete@Carlton:
ReplyDeleteLLVM's intermediate representation - while having things like infinite registers like RTLs initially, before register allocation - is an SSA representation of a program.
The funny thing is SSA is actually a functional intermediate form, in fact analagous to our first intermediate form, GRIN (this is covered in boquist's thesis.) Just as in SSA all variables are assigned to only once, so are variables in GRIN assigned to only once. In SSA form it is traditional to use phi nodes to join control flow; with GRIN, the only thing that can join several paths of control is case statement, used like so:
(case v0 of
... -> ...; unit a1
... -> ...; unit a2
... -> ...; unit a3
); \y -> ...
Note how we bind the result of the case statement using a lambda abstraction, this would be the same as in SSA as
y <- phi(a1,a2,a3)
So LLVM uses an SSA intermediate form, and in a sense, so do we (see 'SSA is Functional Programming' by Andrew Appel.)
The Davidson-fraiser compiler-architecture however, is not concerned with intermediate representation. It is instead an approach to take an already existing intermediate form and convert it to executable code - the RTLs do not contain phi instructions, they are simple abstract machine instructions that are agnostic of the processor.
The reason this approach works is because of the test built-in: the machine invariant (note this term is from Joao Dias's paper, it doesn't state it exactly as such in the original paper.) The machine invariant makes the code expander machine-dependent - it must know if it can generate RTLs of the form:
t[9] := x*t[2]
t[10] := t[9]+t[4]
Or RTLs of the form:
t1[9] := x*t[2]+t[4]
Whether or not the target can do an add and multiply in one instruction affects the output of the machine expander. It's job however is simple, because it generates worst-case code, so it is easy to build new expanders for new architectures.
The optimizations built over the RTLs simply optimize it in a machine-independent way, until we finally use another lightweight machine-dependent part, the emitter, to emit the optimized RTLs as assembly code.
The two technologies are not mutually exclusive because they solve different problems, but I'm not sure if LLVM as an architecture supports compilation to separate processors in this manner.
I don't mean to speak badly of Davidson's research (I know him, he's a great guy and his research was cutting edge). However, RTLs are quite dated, and most modern optimizations are written for CFGs with instructions in SSA form. There is a reason that gcc is trying to switch away from them.
ReplyDeleteHonestly, I think your best bet is to generate your worst case code for llvm bitcode. They already have a great slew of optimizations, and have a pattern matching instruction selector for many target platforms.
More importantly, the beauty of llvm bitcode is that it is both low level, and largely platform independent. It's easy to overlook how simple design choices such as the get-element-pointer instruction allow the bitcode to be agnostic to target features such as data alignment, endianness, etc. Similarly, calling conventions are deferred until after lowering.
But, judging from talks on the llvm-dev list, you are unhappy with llvm's support for garbage collection. Let me also recommend a book "Compiling with Continuations" by Andrew Appel -- it talks about an intermediate representation featuring instruction-level continuation support, and is aimed at people writing compilers for functional languages.
If your goal is high performance Haskell, it would seem to me that the area with biggest payoff would be in garbage collection and compilation strategies that yield better intermediate code. It will be difficult to build a code generator for Intel, PPC, and ARM that will out perform LLVM (unless you can devote many programmer years to it), If LLVM prevents you from implementing a better/different garbage collector then get LLVM to adapt to what you need.
ReplyDelete@Austin:
ReplyDeleteWhen you moved your lhc files from http://thoughtpolice.stringsandints.com/code/lhc/ to http://thoughtpolice.stringsandints.com/code/haskell/lhc/, you broke the image! I fixed it for you, though.
Also, was SVG really necessary?
Carlton Mills:
ReplyDeleteWe /are/ planning on having a LLVM backend. However, if EHC is any indication, a specialized NCG could be significantly faster. Also, see the previous posts on why fixing LLVM isn't as straight forward as one might think.
RTLs aren't completely exhausted. For example.
ReplyDelete(Note that the authors of that paper are now both working on C--.)
@Samuel: Project draw inserted a bad logo; svg was the only way I could cut it out. I'll probably try to just use the haskell diagrams library from now on. :)
ReplyDelete@Pseudonym: yes, the zipper paper is very nice, and also, Joao's dissertation concerns the same topic - building automatic code generators from RTL-based intermediate languages. I believe both of these papers in fact concern their compiler, QuickC--. :)
@Carlton & everybody: Since this is my first post I'll express my thoughts on LLVM - it's complete win. :) The haskell llvm bindings in particular are awesome (if incomplete,) and I'm very much interested in the project. I haven't read the work the people hacking on EHC have done (they have a paper behind Grin -> LLVM,) but in fun experiments, it's excellent that I can write a DSL, generate a bitcode file & optimize it, and just get standalone ASM for a multitude of architectures. This is in fact exactly what I'm describing in my post!
Right now however, it just isn't on the top of my priority list to also learn about LLVM and utilize it fully; I'm squeezed for time as it is to be honest. I could probably get something working with a little time - there seems to be quite bit of interest in this, so it might be worthwhile, but we'll have to see. If anybody wants to work on this or has and has a branch of the HEAD even with a little stuff there, get in contact (IRC is easiest!)