Upgrading minimig with AGA capabilities, Part 1

So now that the first beta version of minimig-AGA for the MiST board is out, I thought I’d write a few words describing how I went about implementing this. In short – it was easier than I expected, but with a few very tricky parts.

A little background

As you may or may not know, Jakub Bednarski (yaqube) already upgraded the minimig design with AGA capabilities a while ago, but the code was never published, and MikeJ, who supposedly has his code isn’t very clear when and if the code will be released. So a month or two ago I got tired of waiting and decided to try to do it myself. At that time, I was just finishing my work on the minimig for the Altera DE1 board and was planning to continue minimig development on the very nice MiST board, which was provided by Till Harbaum.

Since there is no hardware reference manuals for the AGA chipset as there is for the OCS/ECS variants, I mostly relied on the AGA.guide document, which contains all of the new registers and bits, some common sense and lots of guessing when I implemented this. That, and the idea that the designers only made incremental changes, which were mostly compatible with the existing ECS design.

Memory bandwidth and fetch boundaries

First things first – does minimig with SDRAM have enough memory bandwidth to provide 64bit data to the custom chips fast enough? This is the most important thing, because if there is not enough bandwidth, there is just no workaround and the minimig-AGA would be done.

On most, if not all minimig designs that use SDRAM memory, the memory clock is a integer multiple of the minimig core speed and is equal to ~114MHz. The minimig core requires that a 16bit chunk of data is available each ~7MHz clock tick, with AGA upgrading that to a 64-bit chunk. Since the SDRAM controller implementation on minimig works at ~118MHz and can deliver a burst of four 16-bit data packets each clock, the bandwidth is sufficient for AGA 64-bit transfers. Since a 64-bit transfer happens in four ~118MHz clock cycles (which equals one ~28MHz clock cycle), there is just a question of timing – does part of the transfer happen to late – that is, is the minimig core already latching the data before the whole 64-bits are read from SDRAM? It turns out it doesn’t, everything fits just fine (although I wasn’t 100% sure about that at the time).

Another thing that worried me was the four fetch modes of AGA. One of them is the old ‘normal CAS’ single 16bit transfer, that mode was already covered, as that is the same transfer used in OCS/ECS implementation. Than there is the high-speed ‘double CAS’ 2x32bit (64bit) transfer, for which I determined there is enough bandwidth. But, there are two 32bit fetch modes, one implemented as normal CAS 1x32bit transfer and, curiously, a double CAS 2x16bit transfer. This one had me wondering – why did the designers implement two 32bit transfer modes, does the 2x16bit mode allow different alignments than the 1x32bit transfer? I think a little explanation is in order.

SDRAM memory allows different kinds of bursts, they can either be single transfers, a burst of four, a burst of eight or a whole row burst. A burst is configured at SDRAM initialization and defines how many words (= pieces of SDRAM data size, not necessarily 16bit). Another thing defined is how the access to the burst is treated in regards to the address sent to SDRAM. Usually, wrap burst is selected, which means that if the address sent to the SDRAM is xxx1, the 4-sequence of data returned will be with this addresses: xxx1-xxx2-xxx3-xxx0. As you can see, the addresses wrap around the size of the burst, and in a single read / write access, you cannot cross the burst boundary, which means that if the start address is xxx3, you will not be getting address xxx4 next, but address xxx0. The order of addresses is dependent on the selected burst size.

To return to the problem mentioned above, I wondered what if the 2x16bit transfer in contrast to the 1x32bit one allowed 32bit fetches to be aligned on a 16-bit boundary? That would mean that in some cases, the SDRAM would return wrong data, reading address xxx3, xxx0 instead of xxx3, xxx4. Luckily, I haven’t come across such case and it doesn’t seem to be used much with AGA software, as a great majority of software I tested just uses 64bit fetches to maximize available bandwidth for CPU and DMAs other than sprites and bitplanes.

Single clock design

Right after confirming that the SDRAM bandwidth is sufficient, I started the perilous task of converting the whole minimig core to use a single ~28MHz clock. The original minimig design used two clocks, one ~7MHz one and 4x one, ~28MHz, together with some special clock enable signals. Since it’s always better to use a single clock in a design if possible, I decided to convert the minimig code to a single ~28MHz clock and add a ~7MHz clock enable where needed. At the time I was also thinking that a lot more of the AGA code will require the ~28MHz clock instead of the ~7MHz one, but that didn’t turn out to be completely true. This is some really annoying work, with chances to make lots of mistakes if one is not careful, with bugs that could potentially go undiscovered for a long time. To minimize the chances for mistakes, I also split the source code where multiple Verilog modules were in the same file, which is good practice anyway. Unfortunately, that makes my sources somewhat less ‘compatible’ with the original minimig code, but I thought that was a sacrifice worth making.

Starting from the back – upgrading Amber to handle 24bit colors

To remain compatible with the ECS minimig design as long as possible, I figured it is best to start from the end and work my way to the beginning. That means starting with Amber, which is the last module of the video signal in the minimig design. Amber works quite like the real Amber in Amiga 3000 – it scandoubles the video signal, doubling the lines and so converts the 50Hz (or 60Hz for NTSC), 15kHz signal into a more modern-monitor-friendly 50Hz/30kHz. The Amber module also contains logic for smoothing the video signal in horizontal and vertical directions and can produce scanlines. The changes to this module were very simple – extending the data width from 12bits of color data, as used on OCS/ECS Amigas, to AGA’s 24-bit (8bits per color) width. Since the MiST board only has 18bit color output, dither was also added to reduce the 24bit data to 18bits. The dither is quite nice, using spatial and temporal dithering, mixed with random threshold dithering to produce a quite nice image with little to no banding visible. There are some problems with linearity, as the video DAC on the MiST board is made from resistors, which are very hard to match for 6bits of data. I can see some bands on my board, but Till reported a completely fine picture. Alastair also recommended using bigger drive strength for the FPGA’s VGA pins, which showed a nice improvement in picture quality.

Extending the color lookup tables

Next (or more correctly, previous) module on the signal path is Denise with its CLUTs. Since Amiga uses indexed colors, that means the bitplane data is used to lookup the RGB color values in a table. For OCS/ECS Amigas, the CLUT is a 32-entry, 12bit table (32x12bit), which means that each pixel can select one of the 32 entries in the CLUT, and the CLUT data can define one of 4096 possible colors. For AGA, the CLUT was upgraded to 256-entry, 25-bit. In the ECS minimig design, the CLUT was implemented as an asynchronous memory, which in simple terms means that reads are ‘instantaneous’ – the data from the memory is produced in the same cycle as it is addressed. Most (all?) Altera FPGAs don’t have async memory implemented, which means asynchronous memory is implemented as a bunch of registers (which can be read in the same cycle), together with a large demux. This wouldn’t work very well for AGA’s 256×25 bit CLUTs, as that would mean that 6400 registers (at that means 6400 LEs) would be used. For a reference, that is almost three times the size of the TG68 CPU core. So the CLUTs for AGA had to be replaced with a synchronous memory block, which the Cyclone III FPGA device, as used on the MiST board, has plenty of. Unfortunately, that means that the color data is now arriving one clock too late, but luckily, that doesn’t matter much, as this is almost at the end of the video data path, and almost no signals depend on it. As it turns out, there was extra logic handling EHB and HAM modes in the ECS minimig design, which was implemented as registers, and it was simple enough ‘stealing’ the clock delay from this logic, and implementing it as a combinatorial block. Different approach, same delay – problem solved (it is far from being solved, but at least the basics work)!

Other Denise bits & pieces

Leaving the upgraded bitplane and sprite logic for now, there’s a bunch of new bits and registers for the Denise in the AGA version, but the most important ones are for handling writing to and reading from the extended color table. To remain backward-compatible (and because of lack of space in the register map), the AGA designers decided to implement access to the AGA’s 256-entry CLUT through a 32-location window. This allowed for backward-compatibility, as the same register addresses used for accessing CLUT on the ECS chipset are re-used for the new one. The other CLUT entries are accessed with the three BANK bits in BPLCON3, extending the old 5bit CLUT address to 8bits. Since the data width is also extended in AGA, there is an additional bit LOCT in BPLCON3 which defines in which half of the CLUT the data is written to. If set to 0, both upper and lower part of the CLUT is written with the same data, which is a nice way to remain backward-compatible and provide full-range color output (duplicating 12bit data two times is *exactly* what is needed to extend 12bit data to a 24bit output when you want full-range output). If set to 1, only the lower half of the CLUT is written, providing 24bit color to AGA-aware software.

There is also a new register BPLCON4, providing eight BPLAM bits, which are XOR-ed with the bitplane data before entering the CLUT, allowing fast ‘page-flipping’ of colors. The other two bits are ESPRM and OSPRM, providing MSB bits for even and odd sprites respectively.

Continued in Part 2.

HRTmon & Updating NMI vector

One of the features added on the 68010+ Motorola CPUs is a special register called VBR – Vector Base Register, which allows the OS or user program to move the CPU’s vector table away from 68000‘s fixed address $00000 to some other address. On the Amiga, this was usually used to move the vector table to fastRAM, enabling somewhat faster interrupt processing. One of the utilities that allowed this was VBRMove.

I usually have VBRMove installed on my minimig (you can never have enough speed!), and when I implemented HRTmon for the minimig-de1 project, this posed a problem, as once you move the vector table, the HRTmon handling code can no longer detect that access to the NMI vector (vector table offset $7c) has happened. HRTmon code uses a special hook that detects the start of the NMI, which on minimig-de1 is mapped to a button (F11) on the keyboard. In the event of a NMI button being pressed, the Verilog cart code waits for the CPU to access RAM location $7c and instead of allowing the CPU to read the specified RAM location, overrides decoding the address as a RAM access and returns the address of the HRTmon ROM entry point to the CPU. This way no program can disable HRTmon by writing over the NMI vector location, unless of course it does something really mean, like pointing the stack to an odd address.

Since the cart verilog code is inside the minimig design, it can only see the CPU accessing NMI vector if it resides in chipRAM, because the TG68K CPU core has a special bus for fastRAM that completely bypasses the minimig core. I should really fix this someday by moving the NMI detection code in the TG68K core wrapper, but for now I wrote a simple utility that can be used if the vector table resides in fastRAM.

It works by putting the CPU in the Superuser state, reading the VBR, and updating the NMI vector location (VBR+$7c) to point to HRTmon entry, as used in minimig. It than enters user mode and exits. Simple!

; SetNMI.s
; 2013, rok.krajnc@gmail.com
; gets VBR and updates NMI vector to address of HRTmon entry point

execbase = 4
superstate = -150
userstate = -156
NMI_vec = $7c
HRTmon_entry = $00a0000c

EnterSuper:
 move.l execbase,a6
 jsr superstate(a6)
 move.l d0,SaveSP
SetNMI:
 movec vbr,d0
 add.l #NMI_vec,d0
 move.l d0,a6
 move.l #HRTmon_entry,(a6)
EnterUser:
 move.l execbase,a6
 move.l SaveSP,d0
 jsr userstate(a6)
Return:
 rts
SaveSP: blk.l 1

Compiled program attached bellow:

The BoingBall, Part 2

So, we have rendered frames of our BoingBall, as described in Part 1. What you want to do next is check if the animation is seamless – that is, if it loops without glitches. Best way is to convert the frames into a .gif or .avi animation (in my experience GIF works better) with your favorite photo / video editor. You will also need to reduce the number of colors of your frames, I used just two colors (white & red) + background. I needed a few tries to get this right:

BoingBall animation

Once the animation looks good, you need to first decide on what kind of Amiga resolution the animation will be shown – either hires or lores. The difference is that on hires the pixels are not square, but are twice as high as they are long, so you need to ‘squash’ the ball vertically:

BoingBall_128x64

Now you need to somehow convert the animation data to be used on the Amiga. You have to remember that contrary to PCs of today which use so called ‘chunky‘ pixels, Amiga uses a planar graphics display, which splits each bit of the pixel data into a separate plane – that allows it to save memory & bandwidth, if for example you only needed two colors you would only require one bit per pixel instead of 8 (or even 16 or 32!).

For the conversion, I wrote a simple Python script which takes a GIF animation, splits it into frames, splits the frames into bitplanes and writes the result into a source file that can be used in AsmOne assembler on the Amiga. The script is attached bellow.

I also made a minimig logo, which you can see below:

Minimig logo

Now comes the fun part – writing a program that will show this animation on the Amiga – or in this case, the Minimig board. We want this animation to work very early in the Minimig bootup, when no operating system is available, actually even the CPU is not available as it is in reset.

Luckily, hitting the hardware registers directly is nothing special on the Amiga as everyone was (is!) doing it, from demo coders (check out the Amiga Demoscene Archive for some amazing demos) to games. The Amiga chipset is quite complex, so I won’t go into too much details about how to set it up. If you want some great tutorials about writing hardware-hitting Amiga software, you might want to check out Photon’s great coding page and Youtube channel, it was certainly a great help for me to freshen up my Amiga coding 😉

So, we have no OS and no CPU, how can we play an animation? Well, the custom chips of the Amiga will help here. Amiga has some very interesting hardware, but for this animation we are interested in two particularly: the Blitter and the Copper. The Blitter’s name comes from Blit, which is short for block image transfer. Simply put, the Blitter is a configurable DMA engine that can transfer images in memory (it is much more capable, but let’s leave it at that). The Copper, short for co-processor, is a very basic processor that only has three commands: MOVE, WAIT and SKIP, but it is also tied to the video beam. Since the Copper can write Blitter’s registers, they together form a sort of a Turing-complete system, certainly capable enough to play back this animation.

Since you don’t want to write this ‘blind’, you need a way to test that everything works while you’re working on it. I used ASMOne, which is a great assembler for the Amiga, and WinUAE, a windows Amiga emulator. Our program is using the CPU to set up the custom chipset, and once set up, the chipset runs by itself. The CPU part will be replaced with custom code on the minimig, since on minimig the control CPU can write custom registers when the CPU is in reset. For the test program, we need a little more setup than is required for minimig, especially saving enabled interrupts and DMAs, which are restored on exit. The CPU must also copy the minimig logo and the boingball animation data to the proper place in chipram, then it enters a loop waiting for the mouse button press, cleans up and exits. The required part is setting up bitplane DMAs, copper, screen, blitter and the color values:

SysSetup:
 move.w #$0000,$dff1fc ; FMODE, slow fetch mode for AGA compatibility
 move.w #$0002,$dff02e ; COPCON, enable danger mode
 move.l #Copper1,$dff080 ; COP1LCH, copper 1 pointer
 move.l #Copper2,$dff084 ; CPO2LCH, copper 2 pointer
 move.w #$0000,$dff088 ; COPJMP1, restart copper at location 1
 move.w #$2c81,$dff08e ; DIWSTRT, screen upper left corner
 move.w #$f4c1,$dff090 ; DIWSTOP, screen lower right corner
 move.w #$003c,$dff092 ; DDFSTRT, display data fetch start
 move.w #$00d4,$dff094 ; DDFSTOP, display data fetch stop
 ;move.w #$7fff,$dff096 ; DMACON, disable all DMAs
 move.w #$87c0,$dff096 ; DMACON, enable important bits
 move.w #$0000,$dff098 ; CLXCON, TODO
 move.w #$7fff,$dff09a ; INTENA, disable all interrupts
 move.w #$7fff,$dff09c ; INTREQ, disable all interrupts
 move.w #$0000,$dff09e ; ADKCON, TODO
 move.w #$a200,$dff100 ; BPLCON0, two bitplanes & colorburst enabled
 move.w #$0000,$dff102 ; BPLCON1, bitplane control scroll value
 move.w #$0000,$dff104 ; BPLCON2, misc bitplane bits
 move.w #$0000,$dff106 ; BPLCON3, TODO
 move.w #$0000,$dff108 ; BPL1MOD, bitplane modulo for odd planes
 move.w #$0000,$dff10a ; BPL2MOD, bitplane modulo for even planes
 move.w #$09f0,$dff040 ; BLTCON0
 move.w #$0000,$dff042 ; BLTCON1
 move.w #$ffff,$dff044 ; BLTAFWM, blitter first word mask for srcA
 move.w #$ffff,$dff046 ; BLTALWM, blitter last word mask for srcA
 move.w #$0000,$dff064 ; BLTAMOD
 move.w #BLITS,$dff066 ; BLTDMOD
 move.w #$0000,$dff180 ; COLOR00
 move.w #$0aaa,$dff182 ; COLOR01
 move.w #$0a00,$dff184 ; COLOR02
 move.w #$0000,$dff186 ; COLOR03
 move.w #(bpl1>>16)&$ffff,$dff0e0 ; BPL1PTH
 move.w #bpl1&$ffff,$dff0e2 ; BPL1PTL
 move.w #(bpl2>>16)&$ffff,$dff0e4 ; BPL2PTH
 move.w #bpl2&$ffff,$dff0e6 ; BPL2PTL

We set up the space for the bitplanes at $80000:

 ORG $80000
 EVEN
Screen:
bpl1:
 dcb.b BPLSIZE
bpl1E:
bpl2:
 dcb.b BPLSIZE
bpl2E:

Most of the work is done with the copper and blitter. Since the minimig logo is fixed in place, it only needs moving to the proper position in the bitplanes. The rotating ball is also not moving around, so there is no need to clear the bitplanes, we just write new data over the old one. If we want to show the boingball animation with the correct speed, one frame of the animation must be shown for 5 minimig frames (minimig has a refresh rate of 50Hz for PAL). That means quite a long copper list, moving the copper pointer around each frame and the blitter pointer every five frames, since we don’t have a CPU to do any of that. Below is copper code for a single frame of animation, spanning 5 Amiga screen refreshes:

 EVEN
Copper2:
c2f00:
 dc.w $0050,(f0p0>>16)&$ffff
 dc.w $0052,(f0p0)&$ffff
 dc.w $0054,((bpl1+BALLOFF)>>16)&$ffff
 dc.w $0056,((bpl1+BALLOFF))&$ffff
 dc.w $0058,(BLITH<<6+BLITW)
 dc.w $0107,$7ffe
 dc.w $0050,(f0p1>>16)&$ffff
 dc.w $0052,(f0p1)&$ffff
 dc.w $0054,((bpl2+BALLOFF)>>16)&$ffff
 dc.w $0056,((bpl2+BALLOFF))&$ffff
 dc.w $0058,(BLITH<<6+BLITW)
 dc.w $0084,(c2f01>>16)&$ffff
 dc.w $0086,(c2f01)&$ffff
 dc.w $ffff,$fffe
c2f01:
 dc.w $0084,(c2f02>>16)&$ffff
 dc.w $0086,(c2f02)&$ffff
 dc.w $ffff,$fffe
c2f02:
 dc.w $0084,(c2f03>>16)&$ffff
 dc.w $0086,(c2f03)&$ffff
 dc.w $ffff,$fffe
c2f03:
 dc.w $0084,(c2f04>>16)&$ffff
 dc.w $0086,(c2f04)&$ffff
 dc.w $ffff,$fffe
c2f04:
 dc.w $0084,(c2f10>>16)&$ffff
 dc.w $0086,(c2f10)&$ffff
 dc.w $ffff,$fffe

This code is repeated eight times for each frame of the animation.

So, after these two long posts, does it work at all? It sure does:

Whole AsmOne source code is here:

The BoingBall, Part 1

I’ve received some questions about how I made the new minimig logo with the rotating checkered ball, so I thought I’d write something about it for my first post.

The BoingBall is quite famous in the Amiga land, as it was featured in one of the first demos made for the Amiga computer to demonstrate its capabilities at the Consumer Electronics Show in January 1984. It was later used as an official logo of the Amiga. Its roots go even further back, but enough with history. You can watch this video describing how the BoingBall demo came about:


At the time I was replacing the boot code in minimig, and I thought ‘why not add something more dynamic to the boot screen?’ So the idea to use the iconic BoingBall animation was born.

When I was younger, I played around with 3D animation and modeling a lot, mostly with Lightwave on the Amiga and later with 3D Studio running in DOS on a PC. But for this task, a much simpler renderer was used, one that is probably best suited for this task, also one of my favourites – the freely-available POV-Ray. POV-Ray doesn’t have a modeler, you describe the scene in a text file, but for such a simple scene, you wouldn’t need a full-featured modeler anyway.

I won’t go into all the features of POV-Ray language, I’ll just describe the basics needed for the BoingBall animaton. You need three things in a basic scene like this:

  • an object to render
  • a camera, preferably looking at your object
  • lights

You place the camera with a location and a direction statement :

camera
{
  location <0,0,-6>
  look_at <0,0,0>
}

Next, lights – I used both ambient and omni lights around the ball:

global_settings { ambient_light color White }
light_source { <+0, +0, -6> color White }
light_source { <+0, +0, +6> color White }
light_source { <+0, -6, +0> color White }
light_source { <+0, +6, +0> color White }
light_source { <-6, +0, +0> color White }
light_source { <+6, +0, +0> color White }

And then comes the object of interest – a sphere with a red-white checkered pattern:

sphere {
  // placement and size
  <0, 0, 0>, 2
  // texture
  texture {
    pigment {
      // red / white checker pattern
      checker color Red, color White
    }
  }
}

The sphere’s texture needed some fixing of the scale and a warp pattern modifier that wraps the checkered pattern around a sphere:

      // the x-y scaling is a bit off, fixing it together with size
      scale <1.5/pi, 1.0/pi, 1>*0.25
      warp { spherical orientation y dist_exp 1 }

All that is needed now is the animation:

  // rotate (rotate after texture!)
  // 1/8th of full rotation so the texture aligns for a nice animation
  rotate<0,360/8*(clock+0.00),0>
  // adding a slight tilt (after animation rotation!)
  rotate <0,0,-15>

The order of these steps is important in this case – first you apply the texture, than the rotation and the slight tilt the last, otherwise the result will not be what you desired. The clock in the last code snippet is the animation parameter (for a repeating animation you need to calculate how much the ball must rotate in a desired number of frames).

To render this scene, you give POV-Ray some parameters, preferably in an .ini file:

[boingball]
Input_File_Name=boingball.pov
Width=800
Height=600
Antialias=On
Antialias_Threshold=0.3
Initial_Frame=1
Final_Frame=8
Cyclic_Animation = on

And the result:

boingball1

Once you have all of the frames of the animation, you need to parse the image data, transfer it to the Amiga and write a copper list to animate it (yes, the boot logo doesn’t use the CPU, the animation runs using blitter and copper only). But that is a story for another post.

 

Continued in BoingBall, Part 2