Ur-Scheme: A self-hosting compiler for a subset of R5RS Scheme to x86 asm
Ur-Scheme is a compiler from a small subset of R5RS Scheme to x86
assembly language. It can compile itself. It is free software,
licensed under the GNU GPLv3+. It might be useful as a base for a
more practical implementation (or a more compact one), or it might be
enjoyable to read, but it probably isn't that useful in its current
form.
Ur-Scheme is:
- Impractical. It hardly implements anything beyond what's in
R5RS, and it omits quite a bit of the stuff that is in R5RS; for
example, files, macros, vectors, eval, quasiquotation, call/cc, and
floating-point ("inexact numbers"). Only things that were needed for
the compiler to compile itself and run some unit tests are
included.
- Small. The implementation is about 1400 lines of Scheme, plus 700 lines of
comments and blank lines. Compiled with itself on x86 Linux and
stripped, the executable is 128kiB. On my 700MHz laptop from the
previous millennium, compiling it from scratch with MzScheme and gas
takes under ten seconds.
- Safe. Programs compiled with it do not crash or corrupt
their memory unless there is a bug in the compiler (er, except if they
run out of stack or heap). All types and array indices are
dynamically checked at run-time.
- Reasonably fast. It generates reasonably fast code
— when compiled with itself, it runs 35% faster than when it's
compiled with Chicken, and faster than when it's interpreted by any of
Guile, MzScheme, SCM, Elk, or Chicken's interpreter. It also runs
reasonably fast; on my laptop, it compiles about a thousand lines
of Scheme per second, which is not fast in absolute terms (on the same
hardware, gas assembles the resulting 50 000-line assembly file
in just over half a second, which is about 100 000 lines of
assembly per second) but seems to be faster than a lot of Scheme
compilers. I imagine that it would run faster if compiled with Ikarus
or Stalin, but I haven't tried Ikarus and can't get Stalin to
work.
- Portable. Last time I checked, it could compile itself
successfully in Guile, MzScheme (PLT Scheme), SCM, Elk, and Chicken,
although not Bigloo, Stalin, RScheme, or TinyScheme. Retargeting it
to a different processor wouldn't be a huge pain (after all, it's only
1400 lines of code) but wouldn't be that trivial either.
- Unit-tested. It comes with a fairly comprehensive set of
unit tests, and additionally, it compiles itself well enough that the
compiled version produces the same assembly-language output,
byte-for-byte, as when it's running interpreted in some other Scheme
— at least when it's compiling itself.
Downloading
If you want to get it, even though it's impractical, you can use
Darcs to snarf the source repository:
darcs get http://pobox.com/~kragens/sw/urscheme/
Or you can download the source
tarball of Ur-Scheme version 0.
If you have MzScheme installed, you can build it by just typing
"make".
Limitations
- The following are missing: macros, eval, quasiquotation,
first-class continuations, vectors, numerical constants with
radix prefixes, ports, two-argument "if", "let*", "letrec",
non-top-level defines ("internal definitions"), "do", promises,
non-integer numbers, bignums, load, apply,
case-insensitivity, multiple-value returns.
- Pairs are immutable, so there are no set-car! and set-cdr!.
Despite this, eqv? is not the same as equal? for lists.
- If you use some funky characters in identifiers, you'll get
syntactically invalid assembly-language output.
- Procedures are allowed to take fixed numbers of arguments, or any
arbitrary number of arguments, but the normal at-least-N syntax
(lambda (a b . rest) ...) is not supported.
- Rebinding standard procedures may break your program.
- However, some standard procedures are treated as special forms by
the compiler, so rebinding them will rarely have any effect.
- Integer arithmetic silently overflows.
- Most arithmetic operators are omitted; only +, -, quotient, and
remainder are present.
- Very many standard library procedures are missing. Only the
following 55 procedures are actually present: procedure? string?
make-string string-set! string-ref string-length car cdr cons
pair? symbol? symbol->string string->symbol display newline eq?
current-input-port read-char integer? remainder quotient <
eof-object? char? integer->char char->integer list length assq
memq memv append not string-append char-whitespace? char<?
char<=? char-between? char-alphabetic? = char=? eqv? equal?
string=? null? boolean? number? for-each map reverse string->list
list->string number->string string->number write.
- make-string only takes one argument.
- read-char takes a port argument and ignores it; write, display,
and newline do not take port arguments. current-input-port
returns nil.
- map and for-each only take two arguments.
- number->string only takes one argument.
- string->symbol will be slow with enough symbols.
Extensions
These are not in R5RS.
- (display-stderr foo) is like display, but uses stderr.
- (exit 37) makes the program exit with exit code 37.
- (error foo bar baz) aborts the program and sends to stderr
"error: " followed by foo, bar, and baz.
- 1+ and 1- are as in Common Lisp.
- (char->string c) is like (make-string 1 c).
- (escape string) returns a list of character strings which, when
concatenated, form a string representing the original string by
escaping all the backslashes, quotes, or newlines in the string
by preceding them with a backslash.
- #\tab is the tab character. I wouldn't think to mention it but
apparently Stalin doesn't support this.
Bugs
Output is unbuffered, which makes it slow.
I still don't have a garbage collector, and programs crash when
they run out of memory.
The values returned by output procedures are invalid.
+ and - aren't first-class procedures.
Origins
In February 2008, I wanted to write a metacompiler for Bicicleta, but I was
intimidated because I'd never written a compiler before, and nobody
had ever written a compiler either in Bicicleta or for
Bicicleta. So I thought I'd pick a language that other people knew a
lot about writing compilers for and that wasn't too hairy, write a
compiler for it, and then use what I'd learned to write the Bicicleta
compiler.
It took me about 18 days from the time I started on the project to
the time that the compiler could actually compile itself, which was a
lot longer than I expected.
I learned a bunch from doing it. Here are some of the main things
I learned:
- Metainterpreters are a better way to bootstrap.
Interpreters are a lot simpler than compilers, especially if they
don't have to run fast, and especially if you can write them in a
language like Scheme that gives you garbage collection and closures
for free. The restriction that the compiler had to be correct both in
R5RS Scheme and in the language that it could compile was really a
pain. For example, although I could add new syntactic forms to the
language that it could compile, and I could add new syntactic forms
with R5RS macro definitions, I couldn't simplify the compiler by
adding new syntactic forms, because the compiler can't compile R5RS
macro definitions. (There's a portable implementation of R5RS macros
out there, but it's about twice the size of the entire Ur-Scheme
compiler.) Similarly, I was stuck with a bunch of the boneheaded
design decisions of the Scheme built-in types — no auto-growing
mutable containers, separate types for strings and characters, and the
difficulty of doing any string processing without arithmetic and side
effects, for example.
- Start with the simplest thing that could possibly work. I
keep on learning this every year. In this case, the really expensive
thing was that I wanted normal function calls to be fast, so I used
normal C-style stack frames for their arguments. This seems to have
paid off in speed, but it meant I spent four days and about 200 lines
of code implementing lexical closures of unlimited extent. (Really!
According to my change log, from February 13 to February 16, I
basically didn't do anything else except add macros.) If I'd just
allocated all my call frames on the heap, the result would have been
slow, but I would have gotten done a lot sooner.
- Tail-recursion makes your code hard to read. More
traditional control structures, such as explicit loops, are both
terser and clearer. This is the largest Scheme program I've written.
Future Work
First, of course, there are the bugs to fix, especially including
the absence of a garbage collector.
If this compiler has any merit at all, it is in its small size and
comprehensibility. Darius
Bacon's brilliant 385-line "ichbins" self-compiling Lisp-to-C
compiler is much better at that, being less than one-fifth of the
size. So one direction of evolution is to figure out what can be
stripped out of it. ichbins has no arithmetic, no closures, no
separate string or symbol type, and only one side effect.
There's probably a lot that could be made clearer, as well.
Another direction is to try to improve the speed and size of
its output code a bit. For example:
- Lambda-lifting would keep let-expressions from consing
storage for all of the arguments of the outer procedure, and if you
were slightly more ambitious, you could even optimize away the
procedure call entirely.
- A number of the built-in procedures, especially some of the most
common ones, are rather small and would benefit from inlining.
Also, the existing inlining would benefit from a preliminary scanning
pass to see which globals are never mutated and can therefore
be inlined safely without breaking Scheme's semantics. This could
also remove the type-check, the two fetches, and the computed call
instruction, on the vast majority of procedure calls, and the
argument-count check in the vast majority of procedure prologues.
- Especially conditional expressions could benefit from
inlining into their parent if. Something like half of the
conditional expressions in the compiler are either (if (null? ...)
...) or (if (eq? ...) ...) or their inverses, and in
both cases the actual test is a single machine instruction; even the
call is six instructions, and is followed by two instructions to turn
the result-in-a-register back to a flag you can jump on. The
implementation of case is particularly inefficient because it
actually calls memv.
- A little bit of peephole optimization usually goes a long
way, but I'm not sure it would in this case; we could get a little bit
of dead-code removal (e.g. procedure epilogues preceded by an
unconditional jump) and there's the occasional pop %eax; push %eax;
movl foo, %eax sequence.
- The procedure prologues and epilogues are pretty large,
and they could probably be shrunk quite a bit, especially for
non-variadic procedures. It might be possible to arrange the stack
frame so that a simple leave; ret $12 sequence, or something
similarly simple, could replace the current eleven bytes of crap, at
least for non-variadic procedures. Tail calls are particularly
horrific; here's a three-argument tail call:
804977f: 50 push %eax
8049780: a1 20 89 06 08 mov 0x8068920,%eax
8049785: 8d 5c 24 08 lea 0x8(%esp),%ebx
8049789: 8b 55 fc mov 0xfffffffc(%ebp),%edx
804978c: 8b 65 f8 mov 0xfffffff8(%ebp),%esp
804978f: 8b 6d f4 mov 0xfffffff4(%ebp),%ebp
8049792: ff 33 pushl (%ebx)
8049794: ff 73 fc pushl 0xfffffffc(%ebx)
8049797: ff 73 f8 pushl 0xfffffff8(%ebx)
804979a: 52 push %edx
804979b: e8 fb e8 ff ff call 804809b <ensure_procedure>
80497a0: 8b 58 04 mov 0x4(%eax),%ebx
80497a3: ba 03 00 00 00 mov $0x3,%edx
80497a8: ff e3 jmp *%ebx
That's 43 bytes of code! In this case, as is typical, the function
being called (char-between?) is never redefined; it takes a
fixed number of arguments and is never called with any other number of
arguments; and it is not a closure. These observations would allow
the following abbreviated version:
50 push %eax
8d 5c 24 08 lea 0x8(%esp),%ebx
8b 55 fc mov 0xfffffffc(%ebp),%edx
8b 65 f8 mov 0xfffffff8(%ebp),%esp
8b 6d f4 mov 0xfffffff4(%ebp),%ebp
ff 33 pushl (%ebx)
ff 73 fc pushl 0xfffffffc(%ebx)
ff 73 f8 pushl 0xfffffff8(%ebx)
52 push %edx
e9 xx xx xx xx jmp _char_betweenP_4
That's only 28 bytes, a substantial improvement, but still ugly.
If we could skip the prologue of tail-called non-closure procedures
and just jump directly to the body, then in cases where the caller has
at least as many arguments as the callee, we could also avoid copying
the saved %esp, %ebp, and return address to a new location on the
stack. In that case we could simply do something like this:
8d 5c 24 08 lea 0x8(%esp),%ebx
8b 65 f8 mov 0xfffffff8(%ebp),%esp
ff 33 pushl (%ebx)
ff 73 fc pushl 0xfffffffc(%ebx)
50 push %eax
e9 xx xx xx xx jmp _char_betweenP_4
That's only 18 bytes. Still not that pretty, but no longer
disgusting.
- Storing symbols in a skip list or hash table would
probably help a lot with programs like compilers that use
string->symbol a lot.
Another direction is to make it more practical. For
example, improved error-reporting, a profiler, an FFI, and access to
files and sockets would be handy.
Another direction is to make it faster. Its assembly output
is largely comprised of fixed sequences of instructions stuck
together, but those instructions are recreated through many layers of
calling every time.