A disassembly of the 64-byte version of Klappquadrat
====================================================

<link rel="stylesheet" href="../style.css" />

Klappquadrat (`klapquad.com`) is a 31-byte MS-DOS display hack (a
category known as the "32b intro") written by an entity known as T$.
Packaged in `klapquad.zip` is also a 64-byte version `quad_64b.com`
that makes even nicer graphics.  Here's more or less how it looks in
DOSBox:

<img src="quad_64b.png" /> (video formats: <a
href="quad_64b.ogg">250kilobyte Ogg Theora</a> or <a
href="quad_64b.avi">4 megabyte ZMBV avi</a>)

Very Basics: The Source
-----------------------

T$ was kind enough to include the assembly source to the 31-byte
version:

    ;Klappquadrat 32b
    ;32 byte intro source by T$
    ;Greets to mados, cthulhu, spacey and neo

    org 100h

      mov    al,13h
      int    10h
      lds    ax,[bx]

    schleife:

    mov ax,di
    xor dx,dx
    mov bx,320
    div bx
    ;dx=x, ax=y

    add ax,cx
    add dx,cx
    and ax,dx
    shr ax,cl

    xor [di],al 

    inc di
    jnz schleife

    inc cx
    jmp short schleife

The 64-byte version looks fairly similar:

    kragen@thrifty:~/pkgs/klapquad$ objdump -D -b binary -m i8086 -M intel quad_64b.com

    quad_64b.com:     file format binary

    Disassembly of section .data:

    0000000000000000 <.data>:
       0:   b0 13                   mov    al,0x13
       2:   cd 10                   int    0x10
       4:   c5 07                   lds    ax,DWORD PTR [bx]
       6:   ba c9 03                mov    dx,0x3c9
       9:   b5 03                   mov    ch,0x3
       b:   66 c1 c8 08             ror    eax,0x8
       f:   3c 3f                   cmp    al,0x3f
      11:   72 04                   jb     0x17
      13:   b0 3f                   mov    al,0x3f
      15:   fe c4                   inc    ah
      17:   f6 c1 03                test   cl,0x3
      1a:   74 01                   je     0x1d
      1c:   ee                      out    [dx],al
      1d:   e2 ec                   loop   0xb
      1f:   89 f8                   mov    ax,di
      21:   31 d2                   xor    dx,dx
      23:   bb 40 01                mov    bx,0x140
      26:   f7 f3                   div    bx
      28:   01 c8                   add    ax,cx
      2a:   01 ca                   add    dx,cx
      2c:   21 d0                   and    ax,dx
      2e:   c1 e8 03                shr    ax,0x3
      31:   d3 e8                   shr    ax,cl
      33:   00 05                   add    BYTE PTR [di],al
      35:   47                      inc    di
      36:   75 e7                   jne    0x1f
      38:   41                      inc    cx
      39:   e4 60                   in     al,0x60
      3b:   fe c8                   dec    al
      3d:   75 e0                   jne    0x1f
      3f:   c3                      ret

The rest of this file is about 3000 words of my commentary on these 31
instructions.  I might be wrong about some things, because I'm pretty
ignorant about assembly language and MS-DOS, and this code is a bit
clever.  I'm mostly doing this to learn some of the tricks in the
code.

The Setup
---------

       0:   b0 13                   mov    al,0x13
       2:   cd 10                   int    0x10
       4:   c5 07                   lds    ax,DWORD PTR [bx]

This part is in the original source; it's a pretty standard way to
start out a small graphics intro, like something in the 32b or 64b
categories.  [fr-016][] starts the same way, but with a `les`
instead of a `lds`.  `ax` starts out as 0, and in `int 10h`, `ah`
selects the service.  Service 0 is setting the video mode; you specify
the video mode in `al`.  So the first two instructions set the video
mode to mode 13h, which is the very handy 320x200x256 "MCGA" mode
supported by almost all SuperVGA cards.

[fr-016]: fr-016.html "My analysis of Farbrausch's fr-016: bytes"

See [my notes on fr-016][fr-016] for how the `lds` thing works,
although `fr-016` uses `les` instead of `lds`.  Setting the `ds`
register like this makes the program code more or less inaccessible
for data operations.

The Palette Setup Loop
----------------------

       6:   ba c9 03                mov    dx,0x3c9
       9:   b5 03                   mov    ch,0x3

Note that there's a `loop` instruction below that jumps to `0xb`, so
the above two instructions are setup for a loop.  Here's the body of
the loop:

       b:   66 c1 c8 08             ror    eax,0x8
       f:   3c 3f                   cmp    al,0x3f
      11:   72 04                   jb     0x17
      13:   b0 3f                   mov    al,0x3f
      15:   fe c4                   inc    ah
      17:   f6 c1 03                test   cl,0x3
      1a:   74 01                   je     0x1d
      1c:   ee                      out    [dx],al
      1d:   e2 ec                   loop   0xb

Loops generally need to have some kind of side effect in them to be
useful, and in this case, it looks like the purpose of the loop is the
`out` instruction, which writes the byte `al` to the port in `dx`;
`dx` got set in the loop setup code above to 3c9h, and isn't modified
inside the loop, so it's always writing bytes to this same port.

It turns out that this is the port you write bytes to in order to set
up the VGA palette, so this loop is there to set up the palette.  This
25-byte loop (including the two-instruction setup) is also the major
difference between the 64-byte version and the 31-byte version whose
source is above.

The normal sequence, according to
<http://www.brackeen.com/vga/bitmaps.html#5>, is that you write the
palette index to port 3c8h, then the six-bit red, green, and blue
values in sequence to port 3c9h.  It says you can load the whole
palette by first writing a zero to 3c8h, and then writing all 256
palette entries in sequence to 3c9h.  So I hypothesize that in general
you can write any number of palette entries in sequence this way, and
the (emulated) card happens to default to setting palette entry zero
at bootup.

Interestingly, this part of the program seems to work differently in
FreeDOS under QEMU than in DOSBox; the palette I get in FreeDOS is the
black, red, orange, yellow, white palette you can see in the
screenshot above, while the one I see in QEMU ranges from green
through yellow to white, with no black.  I get a somewhat similar
effect, actually, if I run this program twice inside DOSBox, but it's
cyan instead of green.

This suggests that the problem is the initial state of the `eax`
register, and indeed if I insert an `xor eax, eax` at the beginning
of the program, it displays more correctly in FreeDOS in QEMU, and can
run more than once without screwing up the colors in DOSBox.

So, anyway, how does this loop produce this sequence of colors?

Presumably `ax` starts out as whatever the BIOS video mode routine
leaves in it.  Each time through the loop, we rotate it by 8 bits, so
every four times through the loop, it will be rotated back to its
original position.  The two bottom bits of the loop counter are tested
by the `test` instruction, and one time out of four, we skip the `out`
instruction, so one of the four bytes in `eax` is invisible.

This loop seems to set all 256 palette entries.  At first that was
what I expected it to do, since that would be 3 * 256 color
components, and we set `cx` to 3 * 256 by setting `ch` to 3.  But then
I thought that actually `cx` gets decremented by the `loop`
instruction even when we didn't output a color component, so I thought
it would only set the first 192 colors.  But if I increase `ch` to 4,
then the first many colors all get set to white, I guess because it
wrapped around.

So, anyway, what were those colors?  Here's the code that computes them.

       b:   66 c1 c8 08             ror    eax,0x8
       f:   3c 3f                   cmp    al,0x3f
      11:   72 04                   jb     0x17
      13:   b0 3f                   mov    al,0x3f
      15:   fe c4                   inc    ah

So first it rotates `eax` by a byte; then it checks to see if the low
byte is greater than or equal to 0x3f, which is 63, the largest
six-bit value.  If so, it thresholds it to 0x3f and increments `ah`,
the next byte up.  So when red reaches its max, green starts to
increment, when green reaches its max, blue starts to increment; when
blue reaches its max, the invisible byte starts to increment; and when
the invisible byte reaches its max, red starts to increment.  So, if
we start with some large number in the invisible byte, some small
number in the red byte, and zeroes elsewhere, this will give us the
dark red, light red, orange, yellow, white sequence that we see.

I seem to remember that this is what's called the "Dutch palette",
because so many demos from the Netherlands used it for a while.

I was surprised to learn that you could use `eax` in 16-bit real mode
like this.  Apparently the operand size prefix 0x66 works in 16-bit
mode to give you 32-bit operands just as it works in 32-bit mode to
give you 16-bit operands.

So the remaining mysteries to me at this point were:

1. Why do all 256 palette entries get set, and not just 192 of them?

2. Where does the large number in the invisible byte (I guessed it's
   `ah` on entry to the loop, but I was wrong) come from?  It must be
   something the BIOS `int 10h` call is sticking there, I incorrectly
   thought, but what is it and what does it mean?

To answer these questions, I resorted to probing QEMU with GDB.
First, in QEMU's console, while running `quad_64b.com`:

    (qemu) info registers
    EAX=3f3f0000 EBX=...
    ...
    ES =22e4 ...
    CS =22e4 ...

So I can set a breakpoint at 0x22f40 and get it to stop on program
entry.  GDB is kind of dumb when controlling QEMU in real mode; it can
set breakpoints, and it does stop when its breakpoints get hit, but it
can't figure out that it stopped because the breakpoint was hit, so
you have to manually remove the breakpoint if you want to continue,
and `stepi` doesn't work.  Anyway, so I restarted QEMU and, before
restarting `quad_64b`, I said:

    (qemu) gdbserver
    (qemu)

And then I ran GDB:

    kragen@thrifty:~/devel/circles$ gdb
    GNU gdb 6.4.90-debian
    ...
    (gdb) target remote localhost:1234
    Remote debugging using localhost:1234
    0x0000e830 in ?? ()
    (gdb) set architecture i8086
    The target architecture is assumed to be i8086
    (gdb) b *0x22f40
    Breakpoint 1 at 0x22f40
    (gdb) c
    Continuing.

And then I started the program:

    A:\> quad_64b

And GDB woke up:

    Program received signal SIGTRAP, Trace/breakpoint trap.
    0x00000100 in ?? ()
    (gdb) x/40i $cs*16+$eip
    0x22f40:	mov    $0x13,%al
    0x22f42:	int    $0x10
    0x22f44:	lds    (%bx),%ax
    0x22f46:	mov    $0x3c9,%dx
    0x22f49:	mov    $0x3,%ch
    0x22f4b:	ror    $0x8,%eax
    ...

So I set breakpoints after the return from the interrupt, and at the
top of the loop body, and deleted the breakpoint where I was:

    (gdb) b *0x22f44
    Breakpoint 2 at 0x22f44
    (gdb) b *0x22f4b
    Breakpoint 3 at 0x22f4b
    (gdb) delete 1
    (gdb) c
    Continuing.

    Program received signal SIGTRAP, Trace/breakpoint trap.
    0x00000104 in ?? ()
    (gdb) info registers
    eax            0x400020 4194336
    ecx            0xff     255
    ...

So there are two interesting things.  First, there is a byte that's
more than 0x3f in `eax`, and it's the green byte, which explains the
green color I get when testing in QEMU with FreeDOS.  Third, `ecx` is
0xff, not 0; this explains why all of the palette entries get set,
instead of only three-quarters of them.  (Also, it means the invisible
byte, the one to be skipped, will be the one in `al` at the end of the
fourth, eighth, etc., iterations of the loop, not the first, fifth,
etc.  This is taken into account in my naming of the bytes above.)

Unfortunately, around this point I screwed up in GDB and had to start
QEMU again and reattach GDB, and then redelete some old breakpoints.

    (gdb) c
    Continuing.
    Watchdog has expired.  Target detached.
    (gdb) target remote localhost:1234
    Remote debugging using localhost:1234
    0x0000026f in ?? ()
    (gdb) display/2i $cs*16+$eip
    2: x/2i $cs * 16 + $eip
    0x96f:  mov    1(%bx),%al
    0x972:  mov    13(%bx),%ah
    (gdb) b *0x22f40
    Note: breakpoint 5 also set at pc 0x22f40.
    Breakpoint 6 at 0x22f40
    (gdb) delete 5
    (gdb) c
    Continuing.

    Program received signal SIGTRAP, Trace/breakpoint trap.
    0x00000100 in ?? ()
    2: x/2i $cs * 16 + $eip
    0x22f40:        mov    $0x13,%al
    0x22f42:        int    $0x10
    (gdb) info registers
    eax            0x400000 4194304
    ecx            0xff     255
    ...
    (gdb) p $eax = 0x0
    $6 = 0
    (gdb) b *0x22f46
    Breakpoint 7 at 0x22f46
    (gdb) delete 6
    (gdb) c
    Continuing.

    Program received signal SIGTRAP, Trace/breakpoint trap.
    0x00000106 in ?? ()
    2: x/2i $cs * 16 + $eip
    0x22f46:        mov    $0x3c9,%dx
    0x22f49:        mov    $0x3,%ch
    (gdb) info registers
    eax            0x20cd   8397
    ecx            0xff     255
    ...
    (gdb) 

So `ecx` is 255 upon program entry, and it's not the BIOS that's
setting %eax after all; it's the `lds` instruction, which loads the
`int 20h` instruction from the beginning of the PSP into it!  So red
starts out at 0x20, and the invisible byte starts out at 0xcd, the
`int` opcode, which is above 0x3f, so that's how the incrementation
gets started.

The Main Loop
-------------

Here's the main loop, minus the exit test at the end, which just jumps
back to the beginning:

      1f:   89 f8                   mov    ax,di
      21:   31 d2                   xor    dx,dx
      23:   bb 40 01                mov    bx,0x140
      26:   f7 f3                   div    bx
      28:   01 c8                   add    ax,cx
      2a:   01 ca                   add    dx,cx
      2c:   21 d0                   and    ax,dx
      2e:   c1 e8 03                shr    ax,0x3
      31:   d3 e8                   shr    ax,cl
      33:   00 05                   add    BYTE PTR [di],al
      35:   47                      inc    di
      36:   75 e7                   jne    0x1f
      38:   41                      inc    cx

So the main thing to notice here is that there's only one memory
access here, and it's through the `di` register; and `di` changes only
by being incremented.  So we're writing to each pixel on the screen
sequentially (remember, `ds` is set to the video segment a000h early
on.).

This loop is almost the same as the one in the 31-byte version.  There
are two differences:

- the operation on each pixel is an `add` rather than an `xor`;
- there's an additional `shr` by a constant 3.

Let's take a look at the last bit first:

      35:   47                      inc    di
      36:   75 e7                   jne    0x1f
      38:   41                      inc    cx

On entry to this loop, `cx` is zero; we know this because we just fell
out of the bottom of a `loop` instruction, which decrements `cx` and
jumps unless it decremented it to zero.  But every time `di` hits zero
(that is, wraps around to the top of the screen) we increment `cx`
again.  There are no other places in the loop that change `cx`.  So it
forms a kind of frame counter.

The first little bit:

      1f:   89 f8                   mov    ax,di
      21:   31 d2                   xor    dx,dx
      23:   bb 40 01                mov    bx,0x140
      26:   f7 f3                   div    bx

has a helpful comment in the source:

    ;dx=x, ax=y

0x140 is 320, the number of pixels in a row on the mode 13h screen,
and since there's one byte per pixel, also the number of bytes.  So if
`di` points at a pixel, dividing `di` by 320 should give us the
Y-coordinate as the quotient and the X-coordinate as the remainder;
and that's what the comment means.

The next few instructions are where the black magic happens:

      28:   01 c8                   add    ax,cx
      2a:   01 ca                   add    dx,cx
      2c:   21 d0                   and    ax,dx
      2e:   c1 e8 03                shr    ax,0x3
      31:   d3 e8                   shr    ax,cl
      33:   00 05                   add    BYTE PTR [di],al

The first two `add` instructions essentially shift the coordinates
up and to the left by one pixel every frame.  If you take them out,
you get a much flatter-looking picture, one that looks like this:

<img src="klapquad-noshift.png" />

The three-bit `shr` is also kind of optional.  It throws away the
bottom three bits of coordinate-derived data, which means that the
screen is divided into a bunch of 8x8 tiles, and there are no
differences introduced inside those tiles (although the per-frame
shift can introduce some).  Without the `add`s and this extra `shr`,
you get something that looks mostly like a bunch of particularly
colorful Sierpinski triangles.

The extra `shr` also slows things down a bit; without it, in some
pixels (e.g. (255, 255)), you can get all the way to palette value 255
in a single frame!

So these lines are kind of the heart of the hack:

      2c:   21 d0                   and    ax,dx
      31:   d3 e8                   shr    ax,cl
      33:   00 05                   add    BYTE PTR [di],al

With them in there, you get most of the visual and temporal features
of the original.

You'll notice that the screenshots all appear to be made up of nested
2x2 squares in which the bottom right quadrant of each square is
brighter than the other three quadrants.  That's kind of what you'd
expect if you `and` your coordinates together, right?  The lower right
quadrant of each 2x2 square is the one where both of the coordinate
bits distinguishing the quadrants of that square are 1.

That's pretty much the pattern you get if you just do this:

      2c:   21 d0                   and    ax,dx
      33:   00 05                   mov    BYTE PTR [di],al

Which looks like this, and doesn't animate:

<img src="klapquad-simplest.png" />

I don't know how to completely explain the difference between that and
the previous picture.  Obviously the frame counter is crucial in
actually getting an animation, and during the long period of time when
`cl` is counting up from 8 to 256, the animation essentially pauses;
but I don't quite know how to explain the difference it makes.

That's the main part of my understanding of Klappquadrat I'm still not
yet happy with.

The Exit Test
-------------

      39:   e4 60                   in     al,0x60
      3b:   fe c8                   dec    al
      3d:   75 e0                   jne    0x1f
      3f:   c3                      ret

According to the source for Dirojed, another 32-byte intro, these four
instructions are the "standard ESC check".  I guess port 60h reads as
1 if somebody's hitting the Esc key and something else otherwise; so
decrementing `al` sets the zero flag iff Esc was pressed, and if that
was the case, we fall through to the `ret`, which pops a zero off the
stack, returns to the `int 20h` at the beginning of the PSP, and
terminates the program.  About half the time, this crashes FreeDOS in
QEMU; I don't know why.

The 31-byte version just has an unconditional `jmp short schliefe`
here instead of this ESC check, so there's no way to exit to DOS.

My `gas` Version
----------------

I hand-translated the `objdump` disassembly output into equivalent
`gas` input to facilitate experimenting with changes to the program.
It produces a byte-for-byte identical executable until you uncomment
some of the lines that make it a little more robust.  It looks like
this:

            ## a copy of T$'s 64-byte "Klappquadrat"
            ## compile as follows:
            ## as -R klapquad.s -o klapquad.o
            ## objcopy -O binary klapquad.o klapquad.com.0
            ## dd if=klapquad.com.0 of=klapquad.com bs=256 skip=1
            .code16
            .org 0x100

            ## this makes it work right in FreeDOS in QEMU
            ## and when run multiple times
            #xor %eax, %eax

            movb $0x13, %al
            int $0x10
            #mov $(vidseg-2), %bx    # makes it work right in FreeDOS in QEMU
            lds (%bx), %ax

            mov $0x3c9, %dx
            mov $3, %ch             # orig. 3
    set_palette_loop:       
            ror $8, %eax
            cmp $0x3f, %al
            jb dont_threshold
            mov $0x3f, %al
            inc %ah
    dont_threshold:
            test $3, %cl
            je dont_outb
            outb %al, (%dx)
    dont_outb:
            loop set_palette_loop
    schleife:       
            mov %di, %ax
            xor %dx, %dx
            mov $0x140, %bx
            div %bx
            add %cx, %ax
            add %cx, %dx
            and %dx, %ax
            shr $3, %ax
            shr %cl, %ax
            addb %al, (%di)
            inc %di
            jne schleife
            inc %cx

            in $0x60, %al
            dec %al
            jne schleife
            ret
    #vidseg: .short 0xa000

My Python version
-----------------

There seem to be a limited number of people who can appreciate the
assembly version, even with the explanation, and it is a little hard
to modify; so I wrote this Python version, which I am placing in the
public domain.  It probably isn't easier to run (you need to install
Python, Numeric, SDL, and Pygame instead of any one of DOSBox,
Microsoft Windows, or FreeDOS, and QEMU, and the MS-DOS interfaces are
probably stabler) but it's sure easier to read if you know Python and
not assembly.

    #!/usr/bin/python
    """Recreation of T$'s Klappquadrat intro in Python with Pygame and Numeric.

    This is a recreation of the 64-byte version, and I think 31
    instructions.  By contrast, this is 44 lines of code, about 1600
    characters.  On the other hand, you can change this from 320x200x256
    to, say, 640x480x512 by changing the screensize= and ncolors= lines to
    say (640, 480) and 512.

    Kragen Javier Sitaker wrote this recreation, but T$ is to credit for
    the original intro.

    """

    import pygame, sys
    from Numeric import zeros, subtract, array, arange, where, take, shape, indices

    screensize = (320, 200)
    ncolors = 256

    def colors(masks, levels):
        "Compute a grayscale pixel from bit masks and a floating-point level [0,1)"
        return sum([int(mask * level) & mask for mask, level in zip(masks, levels)])

    def clamp(a, b, c):
        "Threshold b between lower limit a and upper limit c."
        d = where(a < b, b, a)
        return where(d < c, d, c)

    def redraw(screen, buf, palette, frames):
        x, y = indices(screensize)
        # this 256 is not ncolors; it's a timing/pacing thing
        buf += ((x + frames) & (y + frames)) >> (frames % 256) >> 3
        buf %= ncolors
        pygame.surfarray.blit_array(screen, take(palette, buf))

    def main(argv):
        pygame.init()
        screen = pygame.display.set_mode(screensize, pygame.FULLSCREEN)

        buf = zeros(screensize)
        fiery_rgb_integers = clamp(0, subtract.outer(arange(ncolors) + ncolors/8,
                                                     ((array([0, 1, 2]) * ncolors)
                                                      / 4)),
                                   ncolors / 4)
        masks = screen.get_masks()[:3]
        # I'm not sure this palette is exactly right; it only goes to 63
        # in the original...
        palette = array([colors(masks, levels/float(ncolors/4))
                         for levels in fiery_rgb_integers])

        frames = 0
        while 1:
            ev = pygame.event.poll()
            if ev.type == pygame.NOEVENT:
                frames += 1
                redraw(screen, buf, palette, frames)
                pygame.display.flip()
            elif ev.type == pygame.KEYDOWN: break
            elif ev.type == pygame.QUIT: break

    if __name__ == '__main__': main(sys.argv)