Upgrading minimig with AGA capabilities, Part 1

So now that the first beta version of minimig-AGA for the MiST board is out, I thought I’d write a few words describing how I went about implementing this. In short – it was easier than I expected, but with a few very tricky parts.

A little background

As you may or may not know, Jakub Bednarski (yaqube) already upgraded the minimig design with AGA capabilities a while ago, but the code was never published, and MikeJ, who supposedly has his code isn’t very clear when and if the code will be released. So a month or two ago I got tired of waiting and decided to try to do it myself. At that time, I was just finishing my work on the minimig for the Altera DE1 board and was planning to continue minimig development on the very nice MiST board, which was provided by Till Harbaum.

Since there is no hardware reference manuals for the AGA chipset as there is for the OCS/ECS variants, I mostly relied on the AGA.guide document, which contains all of the new registers and bits, some common sense and lots of guessing when I implemented this. That, and the idea that the designers only made incremental changes, which were mostly compatible with the existing ECS design.

Memory bandwidth and fetch boundaries

First things first – does minimig with SDRAM have enough memory bandwidth to provide 64bit data to the custom chips fast enough? This is the most important thing, because if there is not enough bandwidth, there is just no workaround and the minimig-AGA would be done.

On most, if not all minimig designs that use SDRAM memory, the memory clock is a integer multiple of the minimig core speed and is equal to ~114MHz. The minimig core requires that a 16bit chunk of data is available each ~7MHz clock tick, with AGA upgrading that to a 64-bit chunk. Since the SDRAM controller implementation on minimig works at ~118MHz and can deliver a burst of four 16-bit data packets each clock, the bandwidth is sufficient for AGA 64-bit transfers. Since a 64-bit transfer happens in four ~118MHz clock cycles (which equals one ~28MHz clock cycle), there is just a question of timing – does part of the transfer happen to late – that is, is the minimig core already latching the data before the whole 64-bits are read from SDRAM? It turns out it doesn’t, everything fits just fine (although I wasn’t 100% sure about that at the time).

Another thing that worried me was the four fetch modes of AGA. One of them is the old ‘normal CAS’ single 16bit transfer, that mode was already covered, as that is the same transfer used in OCS/ECS implementation. Than there is the high-speed ‘double CAS’ 2x32bit (64bit) transfer, for which I determined there is enough bandwidth. But, there are two 32bit fetch modes, one implemented as normal CAS 1x32bit transfer and, curiously, a double CAS 2x16bit transfer. This one had me wondering – why did the designers implement two 32bit transfer modes, does the 2x16bit mode allow different alignments than the 1x32bit transfer? I think a little explanation is in order.

SDRAM memory allows different kinds of bursts, they can either be single transfers, a burst of four, a burst of eight or a whole row burst. A burst is configured at SDRAM initialization and defines how many words (= pieces of SDRAM data size, not necessarily 16bit). Another thing defined is how the access to the burst is treated in regards to the address sent to SDRAM. Usually, wrap burst is selected, which means that if the address sent to the SDRAM is xxx1, the 4-sequence of data returned will be with this addresses: xxx1-xxx2-xxx3-xxx0. As you can see, the addresses wrap around the size of the burst, and in a single read / write access, you cannot cross the burst boundary, which means that if the start address is xxx3, you will not be getting address xxx4 next, but address xxx0. The order of addresses is dependent on the selected burst size.

To return to the problem mentioned above, I wondered what if the 2x16bit transfer in contrast to the 1x32bit one allowed 32bit fetches to be aligned on a 16-bit boundary? That would mean that in some cases, the SDRAM would return wrong data, reading address xxx3, xxx0 instead of xxx3, xxx4. Luckily, I haven’t come across such case and it doesn’t seem to be used much with AGA software, as a great majority of software I tested just uses 64bit fetches to maximize available bandwidth for CPU and DMAs other than sprites and bitplanes.

Single clock design

Right after confirming that the SDRAM bandwidth is sufficient, I started the perilous task of converting the whole minimig core to use a single ~28MHz clock. The original minimig design used two clocks, one ~7MHz one and 4x one, ~28MHz, together with some special clock enable signals. Since it’s always better to use a single clock in a design if possible, I decided to convert the minimig code to a single ~28MHz clock and add a ~7MHz clock enable where needed. At the time I was also thinking that a lot more of the AGA code will require the ~28MHz clock instead of the ~7MHz one, but that didn’t turn out to be completely true. This is some really annoying work, with chances to make lots of mistakes if one is not careful, with bugs that could potentially go undiscovered for a long time. To minimize the chances for mistakes, I also split the source code where multiple Verilog modules were in the same file, which is good practice anyway. Unfortunately, that makes my sources somewhat less ‘compatible’ with the original minimig code, but I thought that was a sacrifice worth making.

Starting from the back – upgrading Amber to handle 24bit colors

To remain compatible with the ECS minimig design as long as possible, I figured it is best to start from the end and work my way to the beginning. That means starting with Amber, which is the last module of the video signal in the minimig design. Amber works quite like the real Amber in Amiga 3000 – it scandoubles the video signal, doubling the lines and so converts the 50Hz (or 60Hz for NTSC), 15kHz signal into a more modern-monitor-friendly 50Hz/30kHz. The Amber module also contains logic for smoothing the video signal in horizontal and vertical directions and can produce scanlines. The changes to this module were very simple – extending the data width from 12bits of color data, as used on OCS/ECS Amigas, to AGA’s 24-bit (8bits per color) width. Since the MiST board only has 18bit color output, dither was also added to reduce the 24bit data to 18bits. The dither is quite nice, using spatial and temporal dithering, mixed with random threshold dithering to produce a quite nice image with little to no banding visible. There are some problems with linearity, as the video DAC on the MiST board is made from resistors, which are very hard to match for 6bits of data. I can see some bands on my board, but Till reported a completely fine picture. Alastair also recommended using bigger drive strength for the FPGA’s VGA pins, which showed a nice improvement in picture quality.

Extending the color lookup tables

Next (or more correctly, previous) module on the signal path is Denise with its CLUTs. Since Amiga uses indexed colors, that means the bitplane data is used to lookup the RGB color values in a table. For OCS/ECS Amigas, the CLUT is a 32-entry, 12bit table (32x12bit), which means that each pixel can select one of the 32 entries in the CLUT, and the CLUT data can define one of 4096 possible colors. For AGA, the CLUT was upgraded to 256-entry, 25-bit. In the ECS minimig design, the CLUT was implemented as an asynchronous memory, which in simple terms means that reads are ‘instantaneous’ – the data from the memory is produced in the same cycle as it is addressed. Most (all?) Altera FPGAs don’t have async memory implemented, which means asynchronous memory is implemented as a bunch of registers (which can be read in the same cycle), together with a large demux. This wouldn’t work very well for AGA’s 256×25 bit CLUTs, as that would mean that 6400 registers (at that means 6400 LEs) would be used. For a reference, that is almost three times the size of the TG68 CPU core. So the CLUTs for AGA had to be replaced with a synchronous memory block, which the Cyclone III FPGA device, as used on the MiST board, has plenty of. Unfortunately, that means that the color data is now arriving one clock too late, but luckily, that doesn’t matter much, as this is almost at the end of the video data path, and almost no signals depend on it. As it turns out, there was extra logic handling EHB and HAM modes in the ECS minimig design, which was implemented as registers, and it was simple enough ‘stealing’ the clock delay from this logic, and implementing it as a combinatorial block. Different approach, same delay – problem solved (it is far from being solved, but at least the basics work)!

Other Denise bits & pieces

Leaving the upgraded bitplane and sprite logic for now, there’s a bunch of new bits and registers for the Denise in the AGA version, but the most important ones are for handling writing to and reading from the extended color table. To remain backward-compatible (and because of lack of space in the register map), the AGA designers decided to implement access to the AGA’s 256-entry CLUT through a 32-location window. This allowed for backward-compatibility, as the same register addresses used for accessing CLUT on the ECS chipset are re-used for the new one. The other CLUT entries are accessed with the three BANK bits in BPLCON3, extending the old 5bit CLUT address to 8bits. Since the data width is also extended in AGA, there is an additional bit LOCT in BPLCON3 which defines in which half of the CLUT the data is written to. If set to 0, both upper and lower part of the CLUT is written with the same data, which is a nice way to remain backward-compatible and provide full-range color output (duplicating 12bit data two times is *exactly* what is needed to extend 12bit data to a 24bit output when you want full-range output). If set to 1, only the lower half of the CLUT is written, providing 24bit color to AGA-aware software.

There is also a new register BPLCON4, providing eight BPLAM bits, which are XOR-ed with the bitplane data before entering the CLUT, allowing fast ‘page-flipping’ of colors. The other two bits are ESPRM and OSPRM, providing MSB bits for even and odd sprites respectively.

Continued in Part 2.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.