In this post, I want to talk about how I implemented 3D graphics rendering in my Tiny Tapeout demo Underflow Cubed, computing the pixels only a fraction of a scan line before they need to be displayed. This resulted in an algorithm that I'd like to call Ray Interval Tracing, which I will describe below along with a number of additional optimizations that were needed to keep down the size.

The demo took part in the TTSKY26a Demo Scene Competition. Entries must be implemented purely in silicon. You get a limited area for your design, depending on the category (1/2/4 tiles). I was in the 4 tile category. Demos should produce output as RGB222 VGA video (using the Tiny VGA Pmod) and a single digital output for sound (e g, using PWM).

The constraints are pretty tight:

It's not a lot of silicon, and memory takes more space than gates.
Memory is basically available in the form of Flip Flops (FFs), each can hold a single bit. A tile can fit maybe 500 FFs if you arrange them compactly and don't put in anything else. (The demo ended up using 6024 standard cells, of which 526 were FFs).
A new pixel needs to be fed to the VGA signal every 2 clock cycles:
The demo is clocked at 50.4 MHz, which is pretty close to the limit, and outputs in 640x480 60 fps VGA mode with a pixel clock of 25.2 MHz (the slowest pixel clock of any VGA mode).

This is not how 3D rendering is usually done! Normally, you have at least two frame buffers. This allows you to render one polygon at a time, and if you need a little bit of extra time, you can just show the previous frame until the next one is finished.

But what can you do if you want to have some 3D graphics when you have too little memory even for a line buffer? How can you render 3D graphics one scan line at a time with only a few cycles/pixel, and still draw something interesting? That is what I set out to explore when I was making the demo. In this write-up, I want to try to explain what I did, and how I got it to fit into the allotted space.

In case you are not familiar with the term, "racing the beam" refers to computing graphics output just ahead of when the pixels go into the video signal. That is pretty much what we have to do here. We need to find a good way to trade some more computations for less memory. As you will see, there was still space for a little bit of buffering; otherwise, much more computation would have been needed!

Outline:
I'll start by describing the scene structure that we will be using, followed by the geometrical tests needed to traverse it. Then I'll show how we can transform a per pixel ray tracing algorithm so that it can share computations between many pixels in the same scan line. I'll conclude the first part by talking a bit about how it all fits together to render a scene. The algorithm itself was essential to being able to fit a 3D renderer into the demo, but it's far from enough by itself. The second part of the post describes various optimizations connected to the algorithm that were needed to squeeze the demo into 4 tiles for the TT demo competition, many of them crucial including some not so obvious. I'll conclude with a few words about the algorithm's limitations and my development process.

(Note: If you're reading this on a smartphone, you might want to turn it on the side. The code examples seem to be broken in portrait mode.)

Scene structure

I wanted to make a Tiny Tapeout demo with some kind of 3D graphics. But how? I already did some 3D graphics using heightfield rendering in my first Tiny Tapeout demo, Sequential Shadows. The standard algorithm renders the landscape one column of pixels at a time, from bottom to top. It was relatively straightforward to adapt this to TT: I had to calculate the heightfield data on the fly, add some tweaks to make it fast enough, and make the viewer flip their monitor on the side, but that was pretty much it. But this time, I wanted something a bit more free-form, that would also allow you to rotate the camera freely.

The scene structure is inspired by the PC game Descent from 1995, a very early game to allow navigation of a full 3D environment with arbitrary camera rotation. In the game, you fly around in and explore an underground network of tunnels. As I understand it, scenes in the game are composed of convex rooms connected together. Each inside face in each room can be either solid, or a portal that connects to a matching face in another room.

I simplified the scene structure further:

Each room is an axis aligned box
Each of a room's six inside faces can be either solid, or contain a rectangular portal that connects to another room

Here's an example of the scene structure, seen from above (xz plane):

Imagine that each rectangle is a box with a different height and position along the y axis. I haven't drawn any boxes connecting in the y direction since it would be hard to illustrate clearly, but the scene structure supports that too, and it is used in the demo.

This scene structure has some important benefits: It can be explored locally starting from the camera position and moving further away from it, with depth increasing as you move into each new room. No overdraw is needed, since the first hit that you find for any pixel has the smallest depth. Also, you only have to consider very little geometry at any given time (the current room).

Descent uses a frame buffer based recursive rendering algorithm that renders the wall polygons in each room clipped to the intersection of the portal polygons that the room is seen through. Since all portal polygons are convex, their 3D projections become convex as well, as do intersections of the projections.

How do we use the scene structure for scan line based rendering? It seems overkill to try to reproduce the Descent algorithm for each scan line. Let's start from a ray tracing formulation instead. That should help us figure out what is the least amount of work needed to render a single pixel. I'll show you later how we can reuse work between pixels in the same scan line.

We can describe ray tracing through the scene with the following pseudocode, where dr is the direction vector for the current ray (the starting point is fixed at the camera position):

    # Trace the ray through room, in direction dr
    function trace_1(room, dr): 
        axis, sign = <find which inside face of room was hit by the ray>

        if <the ray hit a solid wall>: return <wall hit>

         next_room = next(room, axis, sign)
        if <the ray goes through the portal between room and next_room>:
             return trace_1(next_room, dr) # recurse
        else:
            return <wall hit>

Geometry tests

In this section, I will describe how to evaluate the geometrical tests in trace_1. The important properties that the algorithm depends on are:

The geometric tests in the trace_2 function below take the form of a sequence of binary decisions (represented as calls to the ang_leq function)
The outcome of each decision can switch at most once per scan line, since it evaluates the sign of a function that is linear in the ray direction vector dr, which is linear in the pixel index

If that is enough geometry for you, you can skip ahead to take a quick look at the trace_2 function at the end of this section, and then proceed from there. Otherwise, let's dig a little deeper into the geometry!

How do we

<find which inside face of room was hit by the ray>?
check if <the ray goes through the portal between room and next_room>?

Both tests can be formulated by looking at a few edges of the current/next room box to check which side of each edge that the ray passes.

For the first test, we know that the ray is passing through the current room. From the point of view of the current ray, the inside of the room looks like

The ray is a point that falls somewhere inside the hexagonal outline of the box, the question is in which of the three faces. We will flip the coordinate axes so that all elements of the ray direction dr vector are nonnegative.

The three faces meet at the far corner of the room, and the three edges between the faces (let's call them the x/y/z edge) pass through this corner. We can start by checking which side of the z edge that the ray passes (for instance):

This splits the area into two halves, with the yz face in one half, the xz face in the other, and parts of the xz face in both. Depending on which side the ray is in, we need to check it against either the x or y edge to see whether it hit the xy face or the other face in that half. The process can be summarized in the decision tree

The xy, xz, and yz faces will be referred to as having the face axis z, y, and x, respectively.

For the second test, we know which face the ray hits in the current room. We want to check whether it hits the portal area, which is the subrectangle of the current face that overlaps with the next room. The view as seen from that ray is

We need to check whether the ray passes inside each of the four edges that delineate the portal rectangle (two x and two z edges in this case). If the ray passes outside any of the edges, we hit a wall, and we are finished. Checking against alternating x and z edges (ordered 1-4 below) splits the box face into five areas:

The area in the center is the portal, the others are solid walls.

Edge checks

How can we implement the edge checks used above? Each edge check takes

an edge, described by a box corner and an axis aligned edge direction, and
the current ray direction vector dr,

and checks which side of the edge that the ray passes.

Given three vectors: the ray direction vector dr, the corner position corner, and the edge direction e_edge, the calculation can be done by checking the sign of the 3x3 determinant of the three vectors. Since e_edge is one of the coordinate axes, the computation reduces to a 2x2 subdeterminant of the other two vectors:

    det(e_edge, corner, dr) = corner[i1]*dr[i2] - corner[i2]*dr[i1]

where

    i1 = (axis + 1) mod 3
    i2 = (axis + 2) mod 3

and axis is 0, 1, 2, for e_edge along the positive x, y, z axis respectively.

This is the cheapest way I have found to evaluate this check (but see below for some further optimizations). There is no need for floating point math: if corner and dr have integer or fixed point elements, so will the determinant.

We see that the determinant is linear in the ray direction vector dr (which is linear in the pixel index), which provides the key property stated above: The outcome of each decision is the sign of the determinant, and it can switch at most once per scan line. We also see that each decision needs two multiplies and an addition. My implementation can evaluate one decision per cycle, so the hardware needs to be able to do two multiplications per cycle, which requires a considerable amount of the silicon area. We see that the choice to work with axis aligned boxes results in a major simplification, since a general 3x3 determinant requires 9 multiplications instead of 2 (some with additional precision) and a number of additions.

Pseudocode

Let's define the function ang_leq, which evaluates the edge decision described above:

    function ang_leq(axis, corner, dr):
        i1 = (axis + 1) mod 3
        i2 = (axis + 2) mod 3

        # Arbitrarily consider 0 as positive
        return corner[i1]*dr[i2] - corner[i2]*dr[i1] >= 0

We can now rewrite the trace_1 function above with concrete code for the geometric checks. The details are not important unless you want to try to implement the algorithm yourself, but note that it is essentially a sequence of calls to the ang_leq decision function. The result is

    function trace_2(room, dr, dr_signs):
        # Which face did we hit?
        # ======================
        # get far corner
        corner = room.pos .* dr_signs + room.half_size

        if ang_leq(2, corner, dr):
            axis = ang_leq(0, corner, dr) ? 2 : 1
        else:
            axis = ang_leq(1, corner, dr) ? 0 : 2

        sign = dr_signs[axis]

        # Get the other room connected to this face
        # =========================================
        new_room = get_connected_room(room, axis, sign)
        if new_room == None:
            return wall_hit_color(axis, sign)

        # Did we hit the portal?
        # ======================
        # Check the far corner in the common plane (of the new room)
        # ----------------------------------------------------------
        signs = [1, 1, 1]; signs[axis] = -1
        corner = new_room.pos .* dr_signs + new_room.half_size .* signs

        if !ang_leq((axis+1) mod 3, corner, dr):
            return wall_hit_color(axis, sign)
        if ang_leq((axis+2) mod 3, corner, dr):
            return wall_hit_color(axis, sign)

        # Check the near corner of the new room, which is in the common plane
        # -------------------------------------------------------------------
        corner = new_room.pos .* dr_signs - new_room.half_size

        if ang_leq((axis+1) mod 3, corner, dr):
            return wall_hit_color(axis, sign)
        if !ang_leq((axis+2) mod 3, corner, dr):
            return wall_hit_color(axis, sign)

        # Recurse inside the new room
        # ===========================
        return trace_2(new_room, dr, dr_signs)

where .* is elementwise multiplication, and the actual ray direction vector is dr .* dr_signs. Note that dr_signs is used to flip the world coordinate system (room.pos .* dr_signs) so that we can keep the convention that dr has nonnegative elements.

We see that six ang_leq decisions are needed to traverse one room through trace_2. Some of the checks for hitting the portal could be skipped if we know that the portal stretches all the way to the edge of the face in the current room in that direction.

With this formulation, we can implement a per pixel ray tracer, which I did to check that the results are as expected.

Abstracting the decision tree

We know that any ray starting in one of the rooms of the scene will terminate on a solid wall (non-portal) piece of some inside face of some room. To find which wall was hit, we can traverse the decision tree described by a call to the trace_2 function. Each decision in the tree is determined by a call to the ang_leq decision function.

To traverse the decision tree in hardware, we need a way to represent where we are in the tree at any given time. In trace_2, that trace state is determined by the current room, where we are in the code, and the values of axis and dr_signs.

Let's formulate a new trace function that abstracts out the details of traversing the decision tree by operating directly on the trace state:

function trace_3(trace_state, dr): if wall_hit(trace_state): return wall_hit_color(trace_state, d) # Which side of the current edge did we end up on? d = decision(trace_state, dr) trace_state = next_trace_state(trace_state, d) return trace_3(trace_state, dr)

The decision function evaluates the next decision using an appropriate call to ang_leq, and the next_trace_state function updates the trace state to reflect the decision taken. It will typically take 6 decisions and 6 consecutive calls to next_trace_state to traverse each room, which our previous trace_2 function did in one go.

The decision tree that we need to traverse is encoded by the functions decision, wall_hit, and next_trace_state together. The code for them is a somewhat more convoluted way to represent the original trace_2 function (so I will not show it). Each function call represents work that can typically be performed in a single cycle. The abstractions that these functions provide will be very helpful to let us develop the algorithm further in the next section.

Subdivision

So far, we looked at what is needed to trace one ray at a time. Let us now ask how many rays we can trace together.

Looking at trace_3 and starting at the camera position, how does the execution differ depending on dr? All rays start at the root of the decision tree. At any step, the path splits in two, depending on the outcome of the decision function at that step. Neighboring pixels will usually take very similar paths through the tree. Progressing through the tree, collections of rays that take the same path through the tree will gradually split into smaller collections as new decisions are encountered. Each split corresponds to a geometric edge crossing the scan line.

The rays that need to be traced for a given frame form a pyramid shaped cone, with the apex at the camera position and passing through a rectangle that represents the screen's projection surface:

Since we don't have enough memory to keep track of things between scan lines anyway, let's just look at all the rays for a given scan line:

We get one dr(i) vector for each pixel index i. Together, the ray directions can be described using two vectors, dr_0 and ddr:

dr(i) = dr_0 + i * ddr

where dr(i) goes linearly from dr_0 to dr(i_max) = dr_end.

As we have seen, for a given trace_state, the decision function in trace_3 can change at most once per scan line. The following trace function exploits this property to work efficiently on a whole interval (i1, i2) of rays along the same scan line, from dr(i1) to dr(i2):

# Trace all rays for the pixels i, i1 <= i <= i2 function trace_4(trace_state, (i1, i2)): if wall_hit(trace_state): emit_wall_hit(trace_state, d, (i1, i2)) return else: d1 = decision(trace_state, i1) d2 = decision(trace_state, i2) if d1 == d2: # Both endpoints agree # ==> all pixels in between must agree as well trace_state = next_trace_state(trace_state, d1) trace_4(trace_state, (i1, i2)) else: # The endpoints don't agree; # bisect to find the switching point i1_last = bisect_decision(trace_state, (i1, i2)) trace_4(next_trace_state(trace_state, d1), (i1, i1_last)) #A trace_4(next_trace_state(trace_state, d2), (i1_last+1, i2)) #B

Since each decision function can switch at most once in the scan line, it is enough to check it at the end points of the current interval to see if the decision is the same along the whole interval. And if the two end points don't agree, there is a single switching point that can be found by bisection.

The trace_4 function starts with an initial interval of pixels/rays. As it traverses the decision tree, it will find edges that split the rays into smaller collections (intervals). Wall hits are delivered as intervals from left to right, with color depending on the trace state when the wall hit occurred. After each wall hit, the function returns to continue on the next interval to the right, unless the scan line is finished.

The trace_4 function is recursive, calling itself in three places. Before the recursive call at #A, the implementation needs to save enough state so that it can continue with the call at #B after returning from the first call. The other two calls are tail calls. No information needs to be saved before making them, since there is nothing to do afterwards (except returning according to another #B call saved previously).

Putting it all together

According to the VGA standard, each scan line starts with a horizontal blanking period (hblank), followed by an active period when the visible pixels need to be fed to the VGA signal. When hblank starts, the first step is to calculate the view for the new scan line, in the form of the dr_0 and ddr vectors. Then, subdivision is started at the top level of the decision tree, with (i1, i2) spanning the whole scan line. One decision is evaluated per cycle, except in some cases such as when returning.

The algorithm as described assumes that the sign of each element of dr is fixed, but each sign can change once along a scan line. To handle this, there is an initial decision before the first ang_leq check that checks the signs of the three dr elements, causing the algorithm to subdivide the scan line into intervals with constant dr signs.

When the algorithm encounters a wall hit, the color and length in pixels of the wall segment that was hit is emitted into a 10 entry interval FIFO. This gives the subdivision algorithm some space to stay ahead of the display position. If the FIFO would overflow, subdivision is paused until there is space in it. Each time a pixel needs to be fed to the VGA output, the last interval in the FIFO is shrunk by one pixel, eventually consuming it.

If the FIFO becomes empty, the beam has caught up with the rendering algorithm, and the color of the current pixel is unknown. The subdivision then tries to catch up by giving up on the current interval (unless it's too big, to avoid giving up on the whole rest of the scan line at once).

Optimizations

The basic algorithm is a pretty good starting point for saving memory, since it operates on one scan line at a time, and uses intervals instead of raw pixel values. Here, I want to describe some of the further optimizations that I needed to make to squeeze the demo into 4 Tiny Tapeout tiles. The focus is on saving memory since it needs a lot of area, but there are some optimizations for saving computations as well.

The first four optimizations are absolutely essential to being able to fit the demo into the available space. I'm especially happy with the stack underflow trick, but also the power of two dr stepping, since it's not obvious that they should work as well as they do. The remaining optimizations were also very useful, and are sorted in rough order of how complex they are to explain (ending in the appendix!).

Some of the optimizations are mostly about being careful about what you really need to store. Others are more about arithmetic tricks and finding where you can reduce precision without noticeable impact on the results.

Keeping down the size of the state

The trace state is basically

current room id
coordinates of one corner of the current room (whichever corner we are currently checking edges for)
current depth in the decision tree: number of rooms deep, decision level within the room (6 decision levels per room)
current face axis

The subdivider state is basically

state machine state: evaluating d1, d2, bisecting, or returning?
(i1, i2) interval for subdivision, and corresponding values for bisection
dr vectors for i1 and i2
dr vector

The state is updated in place as needed to traverse the scene, saving enough information to restore it when returning. There is additional state in the stacks, which we get to next.

Keeping down the size of the stack entries

Two stacks are used to store the information needed to continue execution after returning:

The interval stack is popped once for each return from a wall hit. It stores which depth of the decision tree to return to, and the value of i2 to use for #B (i1_last is known when returning).
The face axis stack is popped once for each return from a room. Two bits per return are enough to indicate which direction we came from, so that the state can be updated accordingly.

Stack underflow: keeping down the stack size

The stacks can get quite deep, especially the interval stack. Instead of reserving storage for the maximum stack depth needed, the interval stack is stored in a ring buffer. When too many entries are pushed, the oldest entries are overwritten. When trying to pop an entry that was overwritten, the subdivision is instead restarted from the top of the decision tree, starting at the first pixel that hasn't been emitted to the FIFO yet.

It seems that once the algorithm reaches a certain depth of the interval stack, it tends to stay close to that depth for some time. I was able to reduce the interval stack to just 4 elements, without degrading the visual results in a noticeable way. This is a huge optimization since the interval stack needs to take up a significant part of the silicon area otherwise!

Stepping the dr vector

Subdivision needs to be able to jump back and forth between evaluating the current decision at both i1 and i2, so one dr vector dr1, dr2 is stored for each. During bisection, the algorithm needs to evaluate the same decision at many intermediate points between i1 and i2.

When doing binary search on an interval (i1, i2), we need to choose a bisection point. The common choice is to use the mid point i_mid = (i1 + i2)/2, which minimizes the maximum number of bisection steps that are needed. In our case, we need to calculate a new ray direction vector dr_mid by adding or subtracting ddr * delta_i from dr1 or dr1 for some integer delta_i. It would be much easier if delta_i could be restricted to be a power of two, and it turns out that it can!

After calculating delta_i_opt = (i2 - i1)/2, I find the closest power of two delta_i = ± 2^n and use i_bisect = i2 - delta_i instead. This turns out to reduce the efficiency of binary search only slightly. The new dr value is stored in dr2 by adding or subtracting ddr << n.

This is a very important optimization: If the implementation had needed to compute ddr * delta_i for any integer delta_i, the multiplication would have been about as costly as the one needed in the decision function.

Since bisection updates dr2, it already holds the correct value for the #A call in trace_4 when bisection is finished. The logic for stepping dr values is only implemented for dr2. dr1 only needs to be updated when returning, which is done by updating dr2 according to i2 += 1, and then copying dr1 = dr2.

When returning, i2 can increase by any amount, not necessarily a power of two. dr2 is then updated in step with i2 until i2 matches its new target value. In each cycle, the power of two step is used that will reduce the error the most. This way, any step delta_i2 can be accomplished in more or less log4(delta_i2) cycles, similar to radix 4 Booth multiplication. The updates are done in parallel with popping the stack and evaluating the decision for dr1 after the return.

There is an additional optimization to avoid storing fractional bits for the dr vectors, which relies on a modified multiplication algorithm. See the appendix for details.

Camera view calculations

The camera rotation is calculated by computing values for the dr_0 and ddr vectors before starting the rendering of each scan line. The calculations reuse the logic that is used for stepping dr during subdivision, storing intermediate results in dr1, dr1, and ddr. Multiplication is done in a similar way to when updating the dr vector at a return, so that, e g, multiplication by a number in the range ± 512 can be done in 5 cycles or so.

Sine/cosine functions are calculated by approximating cos(pi/2*t) with 1 - t^2 = (1 - t)*(1 + t), for abs(t) <= 1. By using the form (1 - t)*(1 + t), products such as sin(phi1) * cos(phi2) can be computed by multiplying one 1 ± t factor at a time. It does require an extra bit in the intermediate results, though.

The camera calculations are reasonably fast (about a quarter of the hblank time) since they can work on three vector elements in parallel. This is a good thing, since the subdivision algorithm needs as much of hblank as possible to get a head start.

Interval FIFO

Each entry in the interval FIFO has a length and a 2 bit color index, which is translated to RGB222 for VGA output at the FIFO's output. Most of the size is taken up by the length bits.

To save some space, most of the interval FIFO is organized as a shift register. This avoids the need for circuitry for random-access writes, giving a more compact representation.

The first entry (head) is special since it can merge incoming intervals of the same color.
The last entry (tail) is special since it can reduce the current interval length by one each time a pixel is consumed (the head needs to do this as well, since it can be the only entry).

The entries in between are stored in a shift register. The entries in the shift register are limited to a length of 64 pixels. If the head has more than 64 pixels, it is split so that 64 pixels are pushed into the shift register and the rest are kept in the head. Since the interval FIFO is most needed when there are many short intervals, this saves some space and doesn't seem to degrade the visual results very much.

Some parts of the demo use dithering to be able to represent more shades through the RGB222 VGA output. The dithering uses alternating horizontal stripes of color, so that it can be performed before feeding data into the interval FIFO. This allows the FIFO entries to stay at 2 bits for the color index, and allows the FIFO to merge same-colored intervals more often.

Subdeterminant calculation

The subdeterminant calculation / decision function

    function ang_leq(axis, corner, dr):
        i1 = (axis + 1) mod 3
        i2 = (axis + 2) mod 3

        return corner[i1]*dr[i2] - corner[i2]*dr[i1] >= 0

needs to be able to compute two multiplications per cycle. We want to keep the factors as small as possible, to reduce the size of the multipliers.

The dr vector needs 10 bits per element so that each of the 640 pixels in a scan line can have its own value, otherwise the resolution of the output image would appear to be lower. The most significant bit is the sign bit. The calculations use only the absolute values of dr, so 9 bits are enough for the dr[i1] and dr[i2] factors in the multiplications above.

Since the dr elements passed to ang_leq are known to be nonnegative, we can determine the ang_leq result without multiplication if one of the corner elements is negative: (we assume that both corner elements will not be negative at the same time)

    function ang_leq(axis, corner, dr):
        i1 = (axis + 1) mod 3
        i2 = (axis + 2) mod 3

        if (corner[i1] < 0): return false
        if (corner[i2] < 0): return true

        return corner[i1]*dr[i2] - corner[i2]*dr[i1] >= 0

This means that the corner factors can also be unsigned in the multiplication, saving a bit in each.

I started with quite small worlds, where 8 bits for the corner factors seemed to be enough to resolve details. When I made my worlds bigger, more bits would be needed to cover longer distances, but we don't need as much detail at those distances. Instead of adding more bits to the corner factors, I added a step that shifts corner right until it fits into 8 bits. This loses some precision that was probably not needed at that distance anyway, but behaves identically otherwise, since the determinant is linear in the corner vector.

The resolution needed for the corner elements depends on the resolution of the geometry that we want to render, but more importantly, it depends on the needed resolution of camera movement through the world. In the end, I wanted to move pretty slowly through many scenes, and while the demo renders everything at 60 fps, the slow movement speed combined with 8 bit resolution for the corner factors meant that the movement would only appear to update at 30 fps. To make the motion smoother, I added an interlace feature, where every other scan line is one step ahead in the movement. This creates smoother motion, but polygon edges become more jagged. (There is an option to turn this off in the demo.)

With the above optimizations, the determinant calculation was down to two 9x8 bit multiplications per cycle (and a subtraction/comparison). That is still a lot of silicon. To save a little more space, I used an approximate multiplication a*b that disregards products of bits in a and b that are worth less than 32. This did not seem to add any major graphical artifacts.

Limitations

The scene format limits which kinds of scenes can be represented. One extension could be to allow several portals in the same box face. Obviously, if the scene is too complex, the algorithm will not keep up. In the current setup, it is hard to deal with open spaces, as can be seen in the blue cubes scene in the demo. The open space is composed of many boxes connected in a grid, which forces the algorithm to subdivide many edges that will never be visible.

Development

Initial algorithm experiments were done in a high level language (Julia). This includes trying different variations of the subdivision algorithm, and replacing recursion with explicit stacks. I then developed the verilog implementation of the algorithm together with a reference model and testing code in C++.

As you have seen above, the verilog code includes a lot of optimization tricks, so the C++ model that just describes the algorithm is at a considerably higher level of abstraction. The testing code uses Verilator to run the C++ model in parallel to verilog code, checking that they produce the same result each cycle (except cycles spent returning, which were not modelled in detail). This eventually let me run through the first scene of the demo, checking the results against the C++ model. I also did some profiling with the Julia and C++ models along the way, trying out different variations and optimizations to see how they would influence computation speed and visual artifacts.

The C++ model was incredibly helpful for getting the verilog code to work correctly. At the first mismatch, the testing code prints a lot of diagnostic information about values of interest, and which ones were found to mismatch. Some detective work was usually needed to figure out the cause of each mismatch and how to fix it, but I can't imagine how I would have been able to develop a working verilog implementation without this approach.

After the first scene was working, I didn't have to change much in the rendering algorithm, and it was more hacking to implement the different scenes in the demo. (The synth/music player took a bit of engineering as well.)

Wrap-up

It took quite a few things to get right to get this demo working. My devlog for the project is now more than 14000 lines (approximately the same as the number of frames in the demo!), and that doesn't include the music/synth part. But I'm pretty happy with how it turned out! The limitations on Tiny Tapeout demos can seem quite severe, with the need to compute the graphics more or less racing the beam due to the severe memory limitations. But it turns out that you can do more within those limitations than I thought (at least if you have 4 tiles to work with).

You can check out the the algorithm in action in the Underflow Cubed video.

Appendix: Avoiding storage of fractional bits for the dr vector

In this appendix, I will show how the dr vector can be updated incrementally with good precision, without actually storing any fractional bits for the elements. This saves 24 FFs, almost 5% of the total.

As discussed above, the ray direction vector dr for a given pixel uses 10 bits per element, which is enough for visual detail when rendering a single pixel. As the pixel index i goes from 0 to 639 over the course of a scan line, dr should be interpolated from dr_0 at i=0 to dr_end at i=639. This could be done as described above,

    dr(i) = dr_0 + (i * ddr)

but ddr would be very small, on the order of the value of the least significant bit in dr. To avoid catastrophic roundoff errors, a number of fractional bits would have to be added to dr and ddr.

My original implementation used 8 fractional bits in both dr and ddr. There is no getting around the fractional bits in ddr. They make up almost the whole value, and are needed to represent the change between dr_0 and dr_end. But do we really need to store the fractional bits of dr?

I found a way to avoid storing fractional bits for dr, which works pretty well with the power of two stepping described above. We want to approximate

    dr(i) = dr_0 + ((i * ddr) >> 8)

where both dr and ddr are integer valued. Specifically, we want to be able to calculate dr(i ± 2^n) relatively cheaply from dr(i) for different values of n.

Consider calculating (a*i)/2^8 where a and i are (for example) 10-bit integers. We can express a and i in terms of their bits as

    a = 2^0*a0 + 2^1*a1 + 2^2*a2 + 2^3*a3 + 2^4*a^4 + 2^5*a5 + 2^6*a6 + 2^7*a7 + 2^8*a8 + 2^9*a9
    i = 2^0*i0 + 2^1*i1 + 2^2*i2 + 2^3*i3 + 2^4*i^4 + 2^5*i5 + 2^6*i6 + 2^7*i7 + 2^8*i8 + 2^9*i9

and then multiply them together and expand the product, yielding 10x10 = 100 terms that need to be added up. The weight of each term depends on the index of the a and i bit used:

          a0    a1    a2    a3    a4    a5    a6    a7    a8    a9
    i0    2^-8  2^-7  2^-6  2^-5  2^-4  2^-3  2^-2  2^-1  2^0   2^1
    i1    2^-7  2^-6  2^-5  2^-4  2^-3  2^-2  2^-1  2^0   2^1   2^2
    i2    2^-6  2^-5  2^-4  2^-3  2^-2  2^-1  2^0   2^1   2^2   2^3
    i3    2^-5  2^-4  2^-3  2^-2  2^-1  2^0   2^1   2^2   2^3   2^4
    i4    2^-4  2^-3  2^-2  2^-1  2^0   2^1   2^2   2^3   2^4   2^5
    i5    2^-3  2^-2  2^-1  2^0   2^1   2^2   2^3   2^4   2^5   2^6
    i6    2^-2  2^-1  2^0   2^1   2^2   2^3   2^4   2^5   2^6   2^7
    i7    2^-1  2^0   2^1   2^2   2^3   2^4   2^5   2^6   2^7   2^8
    i8    2^0   2^1   2^2   2^3   2^4   2^5   2^6   2^7   2^8   2^9
    i9    2^1   2^2   2^3   2^4   2^5   2^6   2^7   2^8   2^9   2^10

This is pretty much how a regular single cycle binary multiplier works, except that we have designated the lowest 8 bits of the result as fractional bits.

My modified multiplication algorithm jumper_mul(a, i) changes the weights to

          a0    a1    a2    a3    a4    a5    a6    a7    a8    a9
    i0    0     0     0     0     0     0     0     1     1     2^1
    i1    0     0     0     0     0     0     1     1     2^1   2^2
    i2    0     0     0     0     0     1     1     2^1   2^2   2^3
    i3    0     0     0     0     1     1     2^1   2^2   2^3   2^4
    i4    0     0     0     1     1     2^1   2^2   2^3   2^4   2^5
    i5    0     0     1     1     2^1   2^2   2^3   2^4   2^5   2^6
    i6    0     1     1     2^1   2^2   2^3   2^4   2^5   2^6   2^7
    i7    1     1     2^1   2^2   2^3   2^4   2^5   2^6   2^7   2^8
    i8    1     2^1   2^2   2^3   2^4   2^5   2^6   2^7   2^8   2^9
    i9    2^1   2^2   2^3   2^4   2^5   2^6   2^7   2^8   2^9   2^10

We see that

the integer weights have been kept as is,
the weights smaller than 2^-1 have been set to zero, and
the weights equal to 2^-1 have been set to one.

This will give a small roundoff error compared to using the original table, but it doesn't seem to give any bad visual artifacts. This particular arrangement turns out to allow us to update dr in an efficient way when stepping i by a power of two.

Let jumper_mul(a, i) be the approximation of (a*i)>>8 calculated using the modified table of weights above. Assume that we know jumper_mul(a, i) and want to calculate jumper_mul(a, i+1). Looking at which bits of i changed, we get a number of cases:

    i       i+1
      *0      *1
     *01     *10
    *011    *100
    ...

where all numbers are in binary and * represents any sequence of bits. In the first case, we just set i0 to one, so we need to add the term described by the i0 row in the table to jumper_mul(a, i):

          a0    a1    a2    a3    a4    a5    a6    a7    a8    a9
    i0    0     0     0     0     0     0     0     1     1     2^1

that is, add (a >> 8), plus a7.

If i has the two lowest bits 0b01, the difference is the i1 row minus the i0 row:

a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 i1-i0 0 0 0 0 0 0 1 0 1 2^1

that is, add (a >> 8), plus a6.

If i has the three lowest bits 0b011, the difference is the i2 row minus the i0 and i1 rows:

a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 i2-i0-i1 0 0 0 0 0 1 0 0 1 2^1

that is, add (a >> 8), plus a5, and so on.

The result is that to increase i to i+1, we increase jumper_mul(a, i) by a >> 8, but we also add the value of one of the lower bits of a. Half of the time, that bit is a7 (the first bit shifted out in a >> 8), a quarter of the time it is a6, an eighth of the time it is a5, and so on. This way, the long term difference jumper_mul(a, i2) - jumper_mul(a, i1) is approximated well over many steps going from i1 to i2. Going from i+1 to i is similar.

We can update from jumper_mul(a, i) to jumper_mul(a, i ± 2^n) for other powers of two as well. The calculations are similar to the ones above. E g, when going from i to i+4, the step is a >> 6, adding a5 half of the time, a4 a quarter of the time, etc.

Search This Blog

Possible Realities

Tracing the Beam