Featured Porting a Wolfenstein-type engine to the MEGA65

This post is about something that I have had in mind for a long time: A couple of years ago (or possibly more), someone pointed out to me that the DMAgic controller could in theory be used to paint textures, since texture painting really is just a matter of copying a texture onto the screen. Well, and scaling it appropriately. And dealing with the screen memory layout, so that each successive write goes to the next pixel. Luckily, its possible to do both using some quite simple tricks.

First, let's look at the pixel writing first. First up, remember that the MEGA65's VIC-IV has no frame buffer. Rather, just like the C64's VIC-II, the screen is usually made up of 8x8 character cells. The main difference between bitmap mode and text mode is whether you can choose which cell appears where, or whether they appear in a fixed arrangement. This is actually a strength of the VIC family of video controllers for a Wolfenstein type engine, where we need to draw vertical stripes of pixels. Let me explain:

First, lets consider a toy screen with only 4 columns and 4 rows:


This is the normal arrangement in bitmap mode, with ascending characters going across the screen. So to find the address of the next pixel down in a column we have to work out whether we are crossing a character boundary, and if so, add enough to the address to skip the rest of the characters on the row. But if we use text mode, and put our pixels in the character data, we can arrange the screen differently, striping down the rows instead of across the columns, like this:


Now, if we want to advance one pixel down the screen, we always add the same amount, i.e., the number of bytes to skip down one row in a character. On the VIC-II, that is always one byte, as one bit defines each pixel. However, for a nice looking engine, we would like to have more pixel colours. So for this engine, I am using the VIC-IV's "full-colour text mode". You can read more about it in the MEGA65 Programmer's Reference Guide, but the bottom line is that each pixel of a character is now represented by a byte instead of a bit. This means a row of character data consisting of 8 pixels now takes 8 bytes instead of 1 byte (8 bits). So to go from one pixel to the next, we need to tell the DMA controller to add 8 to the address every time when it is generating the address to write to.

That's the writing side. So lets think about the other half: Reading from the texture, and scaling it. The first observation is that the scaling should be done on the reading side, as we always want to write to each destination pixel exactly once. Thus the solution is to change how the source address is calculated for each new pixel.

If the texture is being drawn at 1:1 scale, then its simple: We can just read each successive byte of the texture data, and write it out with our x8 address generation when writing to the screen.

If the texture needs to appear bigger on the screen than it is in memory, then we need to increment the source address only sometimes, so that the same pixel values get repeated.

Conversely, to draw the texture smaller on the screen than it is in memory, then we need to increment the source address by more than 1 on average, so that we skip some pixels. (Ideally we would calculate some average of the pixel values, but that's too much work for our poor 8-bit machine, and certainly too much to ask the DMA controller to do for us).

Our solution to this problem is to allow the DMA controller to increment either source and/or destination address by between 1/256th of a byte and 255 bytes each time. This way we can skip 8 bytes at a time on the writing side, and quite smoothly scale the texture we are reading on the reading side. There will be a very small fraction of a pixel error at the end sometimes, but this will be a consistent error, and thus won't be noticeable on screen.

What is best about this, is that we can scale and draw a texture of any resolution at any zoom on any height screen, at a constant time cost of 2 cycles per pixel drawn. For a 320x200 screen, that means about 128,000 cycles. Given that the MEGA65 has at least 675,000 cycles per frame in NTSC (or 810,000 in PAL), this means that we could in theory draw the screen at full frame-rate, with some margin to spare.

This compares very favourably with any other possible method, which presumably would have to involve at least one load and store operation (and probably a lot more), resulting in a cost well north of 9 cycles per pixel drawn. I saw the difference first-hand when I was porting the engine, as I didn't initially have the DMA painting implemented. Without the help of DMA, it would take several seconds to draw a single frame, i.e, about 200 PAL frames to draw the screen. With the DMA painting implemented, I was able to achieve a frame rate of up to 10 frames per second, i.e., about 5 PAL frames to draw each frame. This means that our DMA method is about 40x faster than without! That's on top of the 40x speed up thanks to the CPU speed.

So for this particular case, we are able to outperform the C64 by a factor of about 1,600x! That's the power of custom-chips. Interestingly we are also a lot faster than the best Amiga Wolfenstein-type engines I have found. This is to be expected because the full-colour text mode effectively gives us a "chunky" video mode -- but with the extra advantage that we can arrange the screen the way we did, so that the striding between addresses is constant, and thus faster than even a PC of the era could achieve.

But we are getting ahead of ourselves here -- we started talking about how to do the texture painting, and are now talking about a complete engine, without me showing you the steps in between...

So let's talk about Wolfenstein type-engines and how they work. They were the first successful 1st person "3d" engines that allowed the user to look in any direction. They did this by "casting" light "rays" in various directions, and working out where they hit, and then drawing that thing. They also had to work out how far away the thing the light ray hit is, so that they could work out how big to draw it, to simulate the impression of distance. One light ray was cast per column of pixels on the screen, thus the interest in drawing those columns quickly.

The algorithms and optimisations that they make to draw quickly are quite interesting, and there is plenty of information on them out there, as well as example engines. I tried a couple before settling on one that was designed for fast fixed-point calculation (as compared to floating point), since the MEGA65 doesn't have a floating point math unit. I then forked it on github, and set about making work on the MEGA65.

Looking back at the commit log, it took less than 4 hours to adapt it to draw a display on the MEGA65, including using the DMA drawing method described above. It took a further couple of days to fix a few bits and pieces, implement movement around the map using the cursor keys, including bumping into walls and the like.

The sky texture which adds a great feeling of realism and helps to orient the player in terms of which direction they are facing turned out to be super easy to do: Because the sky is "vertical" we can just have a loop texture, and draw it in each column of the screen, down to where the textured wall should appear. The floor on the other hand is a pain, because in this view, the ground is horizontal, and so drawing it in the same way as the sky looks quire weird when you walk forwards or backwards. When you stay still it looks pretty good, though. To help make it look better, I rotate it, but its still not perfect, and there are bugs in the way I am doing it on top of that, where it scrolls in the opposite direction it should when you walk in certain directions. But its a start.

The keyboard input was quite interesting to get right, as for intuitive movement, I wanted to know whether particular keys were being held down, not just pressed. This meant that the MEGA65's hardware accelerated keyboard scanner was only of limited use, because it doesn't indicate when keys are released. However it does provide separate access to the keyboard matrix in a convenient way, similar to how the C64 KERNAL scans the keyboard, but in a separate register pair, so that you can scan joysticks and keyboard simultaneously. As the MEGA65 keyboard is fully diode protected, it is possible to test the state of every single key independently. Thus I implemented both WASD and cursor key control. For good measure I also implemented joystick in port 2. I also implemented 1351 or amiga mouse in port 1 to control the direction you point.

As the sample map that the engine I used was a bit boring, I decided to replace it with a random maze generator. I spent more time on that than intended, until I eventually found a good and complete pseudo-code description that allowed me to implement a working maze generator. The result is in mazegen.c in the repo.

As I wanted to support mazes of various sizes, I decided to use a slab of memory above 64K to hold the information about each maze cell, and make a routine to look up values from that using the lpeek() function in the MEGA65 libc. However, I found that the drawing speed of the engine went through the floor from about 10 FPS to about 2 or 3. This is because checking if a particular block is empty or not occurs right in the heart of the ray casting algorithm. Thus doing a complicated memory fetch was slowing it down.

Worse, as CC65 is horribly slow at passing arguments to functions, that fetch was taking even longer. So I implemented a little bitmap that had just whether a given cell was empty or not in it, and put that at $C000, so that a simple memory read could be used. I wrapped that in a C Pre-processor macro so that there would be no function calls, and the frame-rate improved quite a bit, up to maybe 6 or 7 frames per second.

Along the way I also changed a few routines that were performing multiplications to use the MEGA65's hardware multiplier. But the core routines are still in fairly boring C. I suspect that with some hand-tweaking, it should be possible to make it quite a bit faster than it is now.

Once I had the core engine working, I started thinking about how I would do titles and other text that needs to appear in the game. I could draw them over the rendered 3D display, but that would cause a lot of problems. At the moment, because the rendering is so fast, it looks fine to just render from left-to-right continuously, without using double-buffering. This is great, because I don't have a easily have a spare 64KB of RAM to dedicate to the double buffer (I could ditch the ROM to gain 128KB, but that's a pretty desperate measure). But its not so good if we want to draw text over the top, as we would have to do a second layer of rendering, which would slow things down, and risk making a glitchy display.

Instead I used one of the VIC-IV features which is the ability to draw on a raster more than once. That is, we can basically ask the VIC-IV to do in hardware that which would be a pain to do in software. This works because the VIC-IV has a single raster line buffer into which it renders. If there is enough raster time, its possible to tell the VIC-IV to go back over the raster, and render some more stuff. If you set the appropriate flag, any background pixels in the material being drawn over the top are not drawn, effectively keying the new foreground content over the existing material from the first pass of rendering.

Don't worry if that sounds horribly complicated. Just think about it as a kind of "text sprite": We can put an extra layer (or more!) of text over the top of whatever is being displayed, and have the VIC-IV composite it for us in hardware.

This is the feature that I used to make it possible to put text-based titling over the 3D display. This is used in the intro screen, as can be seen in the video I made of the core of the game that I already have working:

[External Media: https://youtu.be/ANaxmJfJ0VM]

This works by telling the VIC-IV that each line of the screen has 80 characters, the first 40 of which are the ones arranged in the vertical stripes for efficient texture painting. The 41st is the "GOTO" token that tells the VIC-IV to go back to near the left edge, and start drawing the 42nd to 80th over the top of the first forty. In reality there are 2 bytes per character, for a total of 160 bytes per line, beause we need to use 16-bit text mode to access this feature (and to have more than 256 characters on the screen, which is needed for the 3D display which uses 1000 different "characters" to make our optimised bitmap layout).

Thus we need to tell the VIC-IV that each line has 160 chars. We do this by writing 160 into $D058, and 0 into $D059 (because there are less than 160 chars per line. We then put 80 into $D05E to tell the VIC-IV to draw 80 characters on each line. The ability to set these two values independently is necessary for 16-bit text mode. But it is also powerful for making screens bigger than the display area, that can then be hardware smooth-scrolled. This would be very useful for making a side-scroller, for example, as the entire map could be setup as a single giant screen, of which only part is visible at a time.

For the curious, the routines with "overlaytext" in their names in main.c have the magic. Basically it is a lot of address calculation to work out where to put a character to make it appear in the right place. Otherwise, there is the nice routine overlaytext_line_x_position(). This one is interesting, because it makes it simple to set the horizontal position where the compositing of the overlay onto the rasterline will occur. That is, it can be used as a kind of hardware smooth-scroll on a line-by-line basis. This is what is used to slide the title screen text off the right hand side when you start the game.

These features are all explained in more detail in the MEGA65 Book at GitHub - MEGA65/mega65-user-guide: MEGA65 User Guide

As well as general VIC-IV and DMA hacking, we needed random numbers for the game. So I wrote routines for generating random numbers. This includes routines that try to gather thermal entropy from the FPGA to produce true random numbers (although we still have to test that this is reliable). That process is slow, and sometimes you want reproduceable sequences of numbers, say, for consistently producing the same random maze. So I also created srand() for seeding a pseudo random number generator, and rand8(), rand16() and rand32() for obtaining 1, 2 or 4 byte random numbers. Each takes an optional argument that can be used to obtain numbers in a limited range. It works by multiplying the random number by the supplied argument using the hardware multiplier, and returning the upper bytes of the result. The result is nice and fast.

You might have noticed that in the video the game was started from C64 mode. Another little piece of work we have done is to create a wrapper programme that can be used to make any C64-mode oriented programme launchable directly from C65 mode or C64 mode. This is interesting in its own right, so I'll blog about it separately when I get a moment. But suffice to say that the current version of MEGA MAZE can be started from either C64 or C65 mode using the same programme. By naming this AUTOBOOT.C65 on a D81, we can even auto-boot the MEGA65 to run it, if it is on a disk in the floppy drive (or on the D81 image that is mounted by default).

Finally, to get music playing in the game was a bit interesting, because there isn't really space left in the first 64KB of RAM for a SID file. There is fortunately space for an interrupt handler. So I wrote a tiny little interrupt handler that banks in some of the 2nd 64KB of RAM that contains the SID file, calls the play routine, and then puts everything back before resulting the interrupt handler. It looks like this:

uint8_t music_irq[27]={
  0xa9,0xc0,      // LDA #$c0
  0xa2,0x11,      // LDX #$11
  0xa0,0x00,      // LDY #$00
  0xa3,0x00,      // LDZ #$00
  0x5c,           // MAP
  0xa9,0x00,      // LDA #$00
  0x20,0x03,0x10, // JSR $1003
  0xa9,0x00,      // LDA #$00
  0xaa,           // TAX
  0xa8,           // TAY
  0x4b,           // TAZ
  0x5c,           // MAP
  0xea,           // NOP
  0xee,0x19,0xd0, // INC $D019 to ack raster interrupt
  0x4c,0x31,0xea // JMP $EA31

I put it in a C array like this, because I copy it into place in $0340, rather than trying to have an interrupt handler call a C function, which is known to be messy in CC65. The routine itself is very simple: First, set the bottom 8KB slab to be banked, and the bank address to be +$1C000. This causes a SID file stored at $1D000 to be mapped at $1000 (since $1000 + $1C000 = $1D000). It then calls the play routine for the SID file, and then clears the mapping, acknowledges the raster interrupt, and resumes the normal interrupt handler.

There is a trick, though: This little routine has to also be installed at $1C340, so that when the first MAP instruction executes, the rest of the routine is visible in the new memory mapping. The routines that handle all this are in music.c of the raycaster repo. Those routines also handle parsing the play routine address out of the SID file, and patching the above handler to use the correct address:

void music_start(void)

  asm ("sei");


  // Set IRQ handler to music routine

  // Disable CIA interrupt, and enable raster interrupt
  asm ("cli");

I realise that this has all been a bit of a whirl-wind tour, rather than deeply worked examples of the various features. But hopefully it is enough to get you thinking about some of what is possible on the MEGA65, and some example code you can look at, as well as pointers to where it is documented in the GitHub - MEGA65/mega65-user-guide: MEGA65 User Guide repository.