Upgrading minimig with AGA capabilities, Part 2

So, to continue from Part 1, the rest of additions in Denise.

Denise bitplane updates

First, of  course, the bitplane count needs to be upgraded from ECS’ 6 bitplanes to AGA’s 8 bitplanes, together with everything dependent on it, like collision detection for 8 bitplanes, two additional bitplane output buffer registers, playfield support, etc. Quite a lot of code changes in this step, but not hard to do – basically, I just followed the existing code and used common sense and the AGA.guide document.

The tricky part is the 64bit bitplane shifter and accompanying logic. This requires upgrading the existing 16bit output shifter to 64bits, together with new scroller implementation. I spent quite some time on this, and I know it doesn’t work correctly yet, as some games scroll very strangely – moving a few pixels in one direction and then ‘jumping’ back a lot of pixels and so on. The scroller will definitely need to be fixed sometime in the future. The scroller / shifter implementation seems to be dependent on selected resolution *and* the fetch mode, I just don’t see how it fits together. There’s also some undocumented ‘extra_delay’ involved here, I just don’t understand it yet nor see where it comes from.

Denise sprite updates

Sprites are somewhat easier, once you realize how the 64bit fetches fit in. The sprite output data gets extended from four bits to eight bits by adding four new OSPRM/ESPRM bits to MSB of the output. The trick is that the 64bit fetches are only for the sprite data, not for the sprite control word – I figured that out after reading some demo coder description of how the sprite control & data need to be laid out in memory for wide sprites to work. Still missing is the part of sprite scandoubling that can ‘double’ the sprite horizontally, I haven’t figured that out yet.

Agnus bitplane DMA updates

This is the part I spent the most time on. Agnus contains a bitplane DMA sequencer, that does different DMA sequences according to resolution and selected number of bitplanes. To support AGA’s eight bitplanes and two additional fetch modes, the sequencer had to be extended with the new sequences. I studied this a lot, used many pieces of paper drawing diagrams and sequences which could work. In the end, I decided for the simplest approach with the least changes required – as I did for every other AGA change, and it seemed to work. It works very simply now: the sequencer has five ‘programs’, which includes all resolutions (lores, hires & shres) and all three fetch modes. All odd planes come first, followed by even planes, with plane 0 being always the last, since fetching plane 0 starts the parallel-to-serial converters in Denise (and enables sprites!). Another trick is that two of the sequences have free cycles following bitplane DMA cycles. The sequence length is also increased from ECS’ eight cycles to AGA’s 32 cycle sequences.

I made a program visualizing the five different encodings of the sequencer:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

int main()
{
unsigned int ddfseq;

printf("old sequencer:\n");
printf("ddfseq shres hires lores\n");
for (ddfseq=0; ddfseq<8; ddfseq++) {
unsigned int ddfseq_neg = (~ddfseq);
unsigned int shres = (((ddfseq_neg&1)?1:0)<<0);
unsigned int hires = (((ddfseq_neg&1)?1:0)<<1) | (((ddfseq_neg&2)?1:0)<<0);
unsigned int lores = (((ddfseq_neg&1)?1:0)<<2) | (((ddfseq_neg&2)?1:0)<<1) | (((ddfseq_neg&4)?1:0)<<0);
printf("%01d %01d %01d %01d\n", ddfseq, shres, hires, lores);
}
printf("\n");

printf("new sequencer:\n");
printf(" mode 1 = 2-fetch sequence (SHRES FMode = 0)\n");
printf(" mode 2 = 4-fetch sequence (HRES FMode = 0, SHRES FMode = 1)\n");
printf(" mode 3 = 8-fetch sequence (LRES FMode = 0, HRES FMode = 1, SHRES, FMode = 3)\n");
printf(" mode 4 = 8-fetch sequence followed by 8 free cycles (LRES FMode = 1, HRES FMode = 3)\n");
printf(" mode 5 = 8-fetch sequence followed by 24 free cycles (LRES FMode = 3)\n");
printf("ddfseq 01 02 03 04 05\n");
for (ddfseq=0; ddfseq<32; ddfseq++) {
unsigned int ddfseq_neg = (~ddfseq);
unsigned int m1 = (((ddfseq_neg&1)?1:0)<<0);
unsigned int m2 = (((ddfseq_neg&1)?1:0)<<1) | (((ddfseq_neg&2)?1:0)<<0);
unsigned int m3 = (((ddfseq_neg&1)?1:0)<<2) | (((ddfseq_neg&2)?1:0)<<1) | (((ddfseq_neg&4)?1:0)<<0);
unsigned int m4 = (((ddfseq &8)?1:0)<<3) | (((ddfseq_neg&1)?1:0)<<2) | (((ddfseq_neg&2)?1:0)<<1) | (((ddfseq_neg&4)?1:0)<<0);
unsigned int m5 = (((ddfseq &16)?1:0)<<4) | (((ddfseq &8)?1:0)<<3) | (((ddfseq_neg&1)?1:0)<<2) | (((ddfseq_neg&2)?1:0)<<1) | (((ddfseq_neg&4)?1:0)<<0);
printf("%02d %02d %02d %02d %02d %02d\n", ddfseq, m1, m2, m3, m4, m5);
}

exit(EXIT_SUCCESS);
}

I made one mistake in the bitplane DMAs implementation that took me a while to fix – I forgot to change how much the bitplane pointers advanced according to fetch mode. I made numerous tests in AsmOne trying the 64bit bitplane fetches, and they simply didn’t work, but after many reviews of the code I saw my mistake – the code needs to advance the bitplane pointers by 1 for 16bit fetches, 2 for 32bit fetches and 4 four 64bit fetches. Duh 😉 After fixing the bitplane modulos, the bitplane DMAs were ready.

Agnus sprite updates

After figuring out the bitplane DMA updates, the sprite DMA was pretty easy – just advance the sprite data pointers by an appropriate amount, at that is it!

Aaand …. done (almost)!

One of the last thing I changed was the Denise ID register, updating it with the proper AGA ID value. Once you do that and start Workbench, SetPatch will detect an AGA Amiga and try to enable 64bit bitplane fetch mode. Let’s just say the result wasn’t very good at first, but after lots of small bits fixed here and there, I got a perfect-looking desktop with 4x fetch mode enabled! Even some games worked, although most of them were a little slow, since the CPU bandwidth to the chipRAM and kickstart is still as slow as on ECS Amigas (the CPU on AGA Amigas can access chipRAM through a 32bit bus, also kickstart uses two ROM ICs, making kickstart accessible through a 32bit bus also).

That concludes the current minimig AGA implementation. There are still some missing features, like bitplane & sprite scandoubling, plus extending the CPU speed when accessing chipRAM & kickstart, like mentioned above. All in all, I spent around two weeks working on this, a week of afternoon work for upgrading the minimig design to a single 28MHz clock and a week when I was on sick leave for most of the AGA stuff. The bitplane DMAs and the bitplane output shifters were definitely the most tricky parts to get right.

And for any reader that managed to read through this very long two-post rambling – congratulations for your persistence! I’ll invite you to a beer if we ever happen to meet!

If you want to take a look at the minimig code, my repository is here: https://github.com/rkrajnc/minimig-mist

Upgrading minimig with AGA capabilities, Part 1

So now that the first beta version of minimig-AGA for the MiST board is out, I thought I’d write a few words describing how I went about implementing this. In short – it was easier than I expected, but with a few very tricky parts.

A little background

As you may or may not know, Jakub Bednarski (yaqube) already upgraded the minimig design with AGA capabilities a while ago, but the code was never published, and MikeJ, who supposedly has his code isn’t very clear when and if the code will be released. So a month or two ago I got tired of waiting and decided to try to do it myself. At that time, I was just finishing my work on the minimig for the Altera DE1 board and was planning to continue minimig development on the very nice MiST board, which was provided by Till Harbaum.

Since there is no hardware reference manuals for the AGA chipset as there is for the OCS/ECS variants, I mostly relied on the AGA.guide document, which contains all of the new registers and bits, some common sense and lots of guessing when I implemented this. That, and the idea that the designers only made incremental changes, which were mostly compatible with the existing ECS design.

Memory bandwidth and fetch boundaries

First things first – does minimig with SDRAM have enough memory bandwidth to provide 64bit data to the custom chips fast enough? This is the most important thing, because if there is not enough bandwidth, there is just no workaround and the minimig-AGA would be done.

On most, if not all minimig designs that use SDRAM memory, the memory clock is a integer multiple of the minimig core speed and is equal to ~114MHz. The minimig core requires that a 16bit chunk of data is available each ~7MHz clock tick, with AGA upgrading that to a 64-bit chunk. Since the SDRAM controller implementation on minimig works at ~118MHz and can deliver a burst of four 16-bit data packets each clock, the bandwidth is sufficient for AGA 64-bit transfers. Since a 64-bit transfer happens in four ~118MHz clock cycles (which equals one ~28MHz clock cycle), there is just a question of timing – does part of the transfer happen to late – that is, is the minimig core already latching the data before the whole 64-bits are read from SDRAM? It turns out it doesn’t, everything fits just fine (although I wasn’t 100% sure about that at the time).

Another thing that worried me was the four fetch modes of AGA. One of them is the old ‘normal CAS’ single 16bit transfer, that mode was already covered, as that is the same transfer used in OCS/ECS implementation. Than there is the high-speed ‘double CAS’ 2x32bit (64bit) transfer, for which I determined there is enough bandwidth. But, there are two 32bit fetch modes, one implemented as normal CAS 1x32bit transfer and, curiously, a double CAS 2x16bit transfer. This one had me wondering – why did the designers implement two 32bit transfer modes, does the 2x16bit mode allow different alignments than the 1x32bit transfer? I think a little explanation is in order.

SDRAM memory allows different kinds of bursts, they can either be single transfers, a burst of four, a burst of eight or a whole row burst. A burst is configured at SDRAM initialization and defines how many words (= pieces of SDRAM data size, not necessarily 16bit). Another thing defined is how the access to the burst is treated in regards to the address sent to SDRAM. Usually, wrap burst is selected, which means that if the address sent to the SDRAM is xxx1, the 4-sequence of data returned will be with this addresses: xxx1-xxx2-xxx3-xxx0. As you can see, the addresses wrap around the size of the burst, and in a single read / write access, you cannot cross the burst boundary, which means that if the start address is xxx3, you will not be getting address xxx4 next, but address xxx0. The order of addresses is dependent on the selected burst size.

To return to the problem mentioned above, I wondered what if the 2x16bit transfer in contrast to the 1x32bit one allowed 32bit fetches to be aligned on a 16-bit boundary? That would mean that in some cases, the SDRAM would return wrong data, reading address xxx3, xxx0 instead of xxx3, xxx4. Luckily, I haven’t come across such case and it doesn’t seem to be used much with AGA software, as a great majority of software I tested just uses 64bit fetches to maximize available bandwidth for CPU and DMAs other than sprites and bitplanes.

Single clock design

Right after confirming that the SDRAM bandwidth is sufficient, I started the perilous task of converting the whole minimig core to use a single ~28MHz clock. The original minimig design used two clocks, one ~7MHz one and 4x one, ~28MHz, together with some special clock enable signals. Since it’s always better to use a single clock in a design if possible, I decided to convert the minimig code to a single ~28MHz clock and add a ~7MHz clock enable where needed. At the time I was also thinking that a lot more of the AGA code will require the ~28MHz clock instead of the ~7MHz one, but that didn’t turn out to be completely true. This is some really annoying work, with chances to make lots of mistakes if one is not careful, with bugs that could potentially go undiscovered for a long time. To minimize the chances for mistakes, I also split the source code where multiple Verilog modules were in the same file, which is good practice anyway. Unfortunately, that makes my sources somewhat less ‘compatible’ with the original minimig code, but I thought that was a sacrifice worth making.

Starting from the back – upgrading Amber to handle 24bit colors

To remain compatible with the ECS minimig design as long as possible, I figured it is best to start from the end and work my way to the beginning. That means starting with Amber, which is the last module of the video signal in the minimig design. Amber works quite like the real Amber in Amiga 3000 – it scandoubles the video signal, doubling the lines and so converts the 50Hz (or 60Hz for NTSC), 15kHz signal into a more modern-monitor-friendly 50Hz/30kHz. The Amber module also contains logic for smoothing the video signal in horizontal and vertical directions and can produce scanlines. The changes to this module were very simple – extending the data width from 12bits of color data, as used on OCS/ECS Amigas, to AGA’s 24-bit (8bits per color) width. Since the MiST board only has 18bit color output, dither was also added to reduce the 24bit data to 18bits. The dither is quite nice, using spatial and temporal dithering, mixed with random threshold dithering to produce a quite nice image with little to no banding visible. There are some problems with linearity, as the video DAC on the MiST board is made from resistors, which are very hard to match for 6bits of data. I can see some bands on my board, but Till reported a completely fine picture. Alastair also recommended using bigger drive strength for the FPGA’s VGA pins, which showed a nice improvement in picture quality.

Extending the color lookup tables

Next (or more correctly, previous) module on the signal path is Denise with its CLUTs. Since Amiga uses indexed colors, that means the bitplane data is used to lookup the RGB color values in a table. For OCS/ECS Amigas, the CLUT is a 32-entry, 12bit table (32x12bit), which means that each pixel can select one of the 32 entries in the CLUT, and the CLUT data can define one of 4096 possible colors. For AGA, the CLUT was upgraded to 256-entry, 25-bit. In the ECS minimig design, the CLUT was implemented as an asynchronous memory, which in simple terms means that reads are ‘instantaneous’ – the data from the memory is produced in the same cycle as it is addressed. Most (all?) Altera FPGAs don’t have async memory implemented, which means asynchronous memory is implemented as a bunch of registers (which can be read in the same cycle), together with a large demux. This wouldn’t work very well for AGA’s 256×25 bit CLUTs, as that would mean that 6400 registers (at that means 6400 LEs) would be used. For a reference, that is almost three times the size of the TG68 CPU core. So the CLUTs for AGA had to be replaced with a synchronous memory block, which the Cyclone III FPGA device, as used on the MiST board, has plenty of. Unfortunately, that means that the color data is now arriving one clock too late, but luckily, that doesn’t matter much, as this is almost at the end of the video data path, and almost no signals depend on it. As it turns out, there was extra logic handling EHB and HAM modes in the ECS minimig design, which was implemented as registers, and it was simple enough ‘stealing’ the clock delay from this logic, and implementing it as a combinatorial block. Different approach, same delay – problem solved (it is far from being solved, but at least the basics work)!

Other Denise bits & pieces

Leaving the upgraded bitplane and sprite logic for now, there’s a bunch of new bits and registers for the Denise in the AGA version, but the most important ones are for handling writing to and reading from the extended color table. To remain backward-compatible (and because of lack of space in the register map), the AGA designers decided to implement access to the AGA’s 256-entry CLUT through a 32-location window. This allowed for backward-compatibility, as the same register addresses used for accessing CLUT on the ECS chipset are re-used for the new one. The other CLUT entries are accessed with the three BANK bits in BPLCON3, extending the old 5bit CLUT address to 8bits. Since the data width is also extended in AGA, there is an additional bit LOCT in BPLCON3 which defines in which half of the CLUT the data is written to. If set to 0, both upper and lower part of the CLUT is written with the same data, which is a nice way to remain backward-compatible and provide full-range color output (duplicating 12bit data two times is *exactly* what is needed to extend 12bit data to a 24bit output when you want full-range output). If set to 1, only the lower half of the CLUT is written, providing 24bit color to AGA-aware software.

There is also a new register BPLCON4, providing eight BPLAM bits, which are XOR-ed with the bitplane data before entering the CLUT, allowing fast ‘page-flipping’ of colors. The other two bits are ESPRM and OSPRM, providing MSB bits for even and odd sprites respectively.

Continued in Part 2.

HRTmon & Updating NMI vector

One of the features added on the 68010+ Motorola CPUs is a special register called VBR – Vector Base Register, which allows the OS or user program to move the CPU’s vector table away from 68000‘s fixed address $00000 to some other address. On the Amiga, this was usually used to move the vector table to fastRAM, enabling somewhat faster interrupt processing. One of the utilities that allowed this was VBRMove.

I usually have VBRMove installed on my minimig (you can never have enough speed!), and when I implemented HRTmon for the minimig-de1 project, this posed a problem, as once you move the vector table, the HRTmon handling code can no longer detect that access to the NMI vector (vector table offset $7c) has happened. HRTmon code uses a special hook that detects the start of the NMI, which on minimig-de1 is mapped to a button (F11) on the keyboard. In the event of a NMI button being pressed, the Verilog cart code waits for the CPU to access RAM location $7c and instead of allowing the CPU to read the specified RAM location, overrides decoding the address as a RAM access and returns the address of the HRTmon ROM entry point to the CPU. This way no program can disable HRTmon by writing over the NMI vector location, unless of course it does something really mean, like pointing the stack to an odd address.

Since the cart verilog code is inside the minimig design, it can only see the CPU accessing NMI vector if it resides in chipRAM, because the TG68K CPU core has a special bus for fastRAM that completely bypasses the minimig core. I should really fix this someday by moving the NMI detection code in the TG68K core wrapper, but for now I wrote a simple utility that can be used if the vector table resides in fastRAM.

It works by putting the CPU in the Superuser state, reading the VBR, and updating the NMI vector location (VBR+$7c) to point to HRTmon entry, as used in minimig. It than enters user mode and exits. Simple!

; SetNMI.s
; 2013, rok.krajnc@gmail.com
; gets VBR and updates NMI vector to address of HRTmon entry point

execbase = 4
superstate = -150
userstate = -156
NMI_vec = $7c
HRTmon_entry = $00a0000c

EnterSuper:
 move.l execbase,a6
 jsr superstate(a6)
 move.l d0,SaveSP
SetNMI:
 movec vbr,d0
 add.l #NMI_vec,d0
 move.l d0,a6
 move.l #HRTmon_entry,(a6)
EnterUser:
 move.l execbase,a6
 move.l SaveSP,d0
 jsr userstate(a6)
Return:
 rts
SaveSP: blk.l 1

Compiled program attached bellow: