CPU (from: https://www.flickr.com/photos/2top/10402551773/)

qSoC – The OR1200 CPU

The OR1200 is a RISC-type, Harvard architecture (separate instruction and data buses) synthesizable CPU core, written by the OpenCores community.

It can be configured with a number of optional components, such as cache, MMU, FPU, timer, programmable interrupt controller, debug unit, etc. For sake of simplicity, I decided to disable most of the optional components, except the hardware multiplier and divider, all other features will be added if/when needed.

The OR1200 has standard GNU tools available, like gcc, binutils, and a few standard libraries, like uClibc and newlib. There is also an official port of linux available.

One of the requirements was that the CPU would be able to run with a 50MHz frequency on a CycloneIII – class FPGA. The hardware divider implementation used in OR1200 is a 8-cycle divider, which was on a critical timing path, and needed to be reduced to a 16-cycle divider, which, while slow, is still heaps faster than a software division implementation. The changes are in this commit: 01a18ffbbb86b074. There’s a testbench for the updated division code in this commit: 8232f830c60a1166.

There are other settings (defines) available in the or1200_defines.v file, you can tune some of them to get a better utilization or speed from the code. One of the things that I changed to get some more speed was the type of the compare used in the ALU (the change is in this commit: 337217f9511888c2).

The OR1200 version used here is not exactly in sync with the official OR1200 repository, as this core was split from the original a long time ago and changed a lot and the changes didn’t propagate back to the original repo. One of the important changes, besides some bug fixes, is the different bus – QMEM, replacing the original Wishbone bus.

With most of the optional components removed, and with the QMEM bus, the OR1200, as used in this project, is a nice, small but fast little softcore CPU.

SoC

qSoC, or how to build an FPGA SoC from scratch

Introduction

I’d like to talk about how to build a fast, lean, clean SoC machine. What is an SoC? An SoC, or a System On a Chip, is, simply put, a microprocessor with some common peripherals attached, like ROM, RAM, SDRAM controller, UART, SPI, GPIO, and other I/O ports or protocols, all tied up into a system. An SoC is not unlike a microcontroller, which also has a microprocessor bundled with some sort of memory and I/O, the distinction is more or less cosmetical.

An SoC can be used standalone, running barebone or a form of an operating system and communicating with the world outside of the FPGA, or alternatively, it can be used as a part of a larger system, incorporating bigger logic blocks and acting as a sort of support for it. The latter will be the focus of this and following posts.

What I want from the SoC is small footprint (small consumption of FPGA resources), good processing speed, and a well-defined interface to other parts of the FPGA or the outside world. What I have in mind will need at least these components:

  • CPU
  • a well-defined bus interface
  • ROM for the bootloader or the whole firmware
  • RAM, either internal or external SDRAM
  • ‘registers’, which is a catch-all phrase for access to memory-mapped I/O and other functions

CPU

So, let’s start with the CPU. There are a bunch of open-source CPU architectures available, and I’ll definitely want to try and incorporate more than just one into this SoC, so there’s a choice of power vs. resource consumption. But for now, I’ll start with the OpenCores OR1200 (or a slightly modified version of it), since that is the CPU core I’m most familiar with. Later on, I’ll definitely look into adding at least an AVR, an ARM and a RISC-V variants to this SoC.

The OR1200 is a nice, small and relatively simple Harvard architecture RISC CPU, loosely based on the MIPS architecture and instruction set. It is pretty configurable, so you can for example disable the hardware multiplication and division support to save logic gates, and run a software implementation instead. It also has support for cache and MMU, and an optional FPU unit, but I’m going to strip it of all those (unnecessary) addons and just keep the CPU core.

Bus

Next on the list is the bus interface. In my opinion, a well-defined bus interface is a cornerstone of a good SoC design, and must not be overlooked or brushed over quickly. All of the components in an SoC will communicate through this bus, so it best that it is well-designed and thought-through at the beginning, so there are no strange errors popping up all of a sudden, if you add a component somewhere down the line. There are quite some bus architectures intended for an SoC interconnect to choose from, like Wishbone, APB, AHB, etc., but I chose the QMEM bus, especially for its simplicity and speed. As will be explained later, the QMEM bus is not much more than a standard synchronous memory bus with added flow control and optional tags attached to it.

Memory

There’s not much to say about memory, besides that it is needed. At the very least, a small ROM is needed for the bootloader or a sort of a monitor program, that can write and read to memory and registers, and load firmware from the serial port or SD card, but I’ll talk about that later. Besides the ROM, the CPU will need some form of RAM, either from FPGA’s internal memory blocks, or an external memory like SRAM or SDRAM. The SoC should preferably support all of these variants.

I/O

The SoC will need support for some standard external communication protocols. The most important one is an UART, or a serial port, which can be used for debugging, controlling the SoC system with the help of the monitor firmware, uploading of new firmwares, etc. Another very useful protocol is an SPI master, so the SoC can talk to an SD card and load files from it. A GPIO (general purpose I/O controller) can be added to the SoC, so a range of pins can be controlled for digital I/O. I plan to add support for many other I/O channels later on, like a VGA controller, audio output, including sigma-delta DAC and DAC IC support, and other such interfaces.

Other important components

The SoC needs a couple of other standard components, like clock management, reset synchronization module, an interrupt controller, and a timer or two.

Frequency considerations

Another thing to consider is at what frequencies the SoC should run. Usually, in an ASIC product, this would need to be a balance between power consumption and processing speed, limited by the particular process node limitations. Luckily, for an FPGA project, power consumption is not so important, especially taking into account the large static power consumption of FPGAs, so the frequency can be more easily selected based on actual needs.

Personally, I like the trio of 25/50/100MHz frequencies, and I’ll explain what I mean and why.

An SoC will very probably contain an SDRAM controller, and the 100MHz is the max frequency of many SDRAM ICs. Of course, many can run at higher frequencies, up to 166MHz, but the 100MHz is a safe bet to work with any SDRAM IC.

Next up, the 50MHz is in my opinion a nice operating speed for most of the logic in the FPGA, as it nicely balances the required flip-flops for the asynchronous logic to work at this frequency, without wasting a lot of LEs. Any CPU for the qSoC should be capable to work (at least) at this frequency.

The 25MHz is the maximum speed an SD card operates over the SPI bus, and the 25MHz clock can be generated from toggling flip-flops running on the 50MHz clock obviously.

The 50MHz and 25MHz are also frequencies that can be used as pixel clocks for two VGA video resolutions: a 640×480 VGA with 25MHz pixel clock and a 800×600 SVGA with 50MHz clock.

There’s another benefit to using the frequencies that are nice multiples – you can program the PLL in the FPGA to generate these frequencies that are synchronous to each other, which means that you don’t have to use clock-domain crossing logic, since all three clocks can be edge-aligned. This way, you can save gates, plus you don’t have to deal with the headaches that CDC will definitely bring.

Code repository & structure

The repository for this SoC experiment (which I named qSoC for Qmem SoC) is here:

https://github.com/rkrajnc/qsoc

Currently, there’s only OR1200 CPU RTL code there, together with some QMEM modules and a testbench for the OR1200 divider.

I like to keep all files in the repository nicely organized into these directories:

  • rtl – for all common synthesizable Verilog / VHDL code
  • fw – for all common CPU firmware code
  • tools – for all tools / scripts needed for building / converting etc. files needed for the qSoC
  • bench – for Verilog top benchmark files
  • ver – for any scripts needed for verification / benchmarking
  • fpga – for any FPGA board specific files, like Quartus or ISE project files, sdc files etc.

The directory structure might change in the future, as I’d like to keep the projects I build with this SoC in the same repo, so I’ll probably have to add a project directory, with any project-specific files, but I’ll cross that bridge when I get to it.

Boards

I plan to support at least these boards:

More could be added in the future.

Planned projects

The first project I plan to make is a simple WAV audio player, so I’ll be able to test the sigma-delta implementation used in the minimig. After that, probably something involving a VGA controller, a character generator, perhaps a whole system capable of running simple 2D SDL games. Another interest is definitely some sound processing / generation projects, like a MIDI synthesizer, or an FPGA guitar effect. We’ll see.

 

Coming next up, a few more words on the selected CPU and its modifications.

The BoingBall, Part 2

So, we have rendered frames of our BoingBall, as described in Part 1. What you want to do next is check if the animation is seamless – that is, if it loops without glitches. Best way is to convert the frames into a .gif or .avi animation (in my experience GIF works better) with your favorite photo / video editor. You will also need to reduce the number of colors of your frames, I used just two colors (white & red) + background. I needed a few tries to get this right:

BoingBall animation

Once the animation looks good, you need to first decide on what kind of Amiga resolution the animation will be shown – either hires or lores. The difference is that on hires the pixels are not square, but are twice as high as they are long, so you need to ‘squash’ the ball vertically:

BoingBall_128x64

Now you need to somehow convert the animation data to be used on the Amiga. You have to remember that contrary to PCs of today which use so called ‘chunky‘ pixels, Amiga uses a planar graphics display, which splits each bit of the pixel data into a separate plane – that allows it to save memory & bandwidth, if for example you only needed two colors you would only require one bit per pixel instead of 8 (or even 16 or 32!).

For the conversion, I wrote a simple Python script which takes a GIF animation, splits it into frames, splits the frames into bitplanes and writes the result into a source file that can be used in AsmOne assembler on the Amiga. The script is attached bellow.

I also made a minimig logo, which you can see below:

Minimig logo

Now comes the fun part – writing a program that will show this animation on the Amiga – or in this case, the Minimig board. We want this animation to work very early in the Minimig bootup, when no operating system is available, actually even the CPU is not available as it is in reset.

Luckily, hitting the hardware registers directly is nothing special on the Amiga as everyone was (is!) doing it, from demo coders (check out the Amiga Demoscene Archive for some amazing demos) to games. The Amiga chipset is quite complex, so I won’t go into too much details about how to set it up. If you want some great tutorials about writing hardware-hitting Amiga software, you might want to check out Photon’s great coding page and Youtube channel, it was certainly a great help for me to freshen up my Amiga coding 😉

So, we have no OS and no CPU, how can we play an animation? Well, the custom chips of the Amiga will help here. Amiga has some very interesting hardware, but for this animation we are interested in two particularly: the Blitter and the Copper. The Blitter’s name comes from Blit, which is short for block image transfer. Simply put, the Blitter is a configurable DMA engine that can transfer images in memory (it is much more capable, but let’s leave it at that). The Copper, short for co-processor, is a very basic processor that only has three commands: MOVE, WAIT and SKIP, but it is also tied to the video beam. Since the Copper can write Blitter’s registers, they together form a sort of a Turing-complete system, certainly capable enough to play back this animation.

Since you don’t want to write this ‘blind’, you need a way to test that everything works while you’re working on it. I used ASMOne, which is a great assembler for the Amiga, and WinUAE, a windows Amiga emulator. Our program is using the CPU to set up the custom chipset, and once set up, the chipset runs by itself. The CPU part will be replaced with custom code on the minimig, since on minimig the control CPU can write custom registers when the CPU is in reset. For the test program, we need a little more setup than is required for minimig, especially saving enabled interrupts and DMAs, which are restored on exit. The CPU must also copy the minimig logo and the boingball animation data to the proper place in chipram, then it enters a loop waiting for the mouse button press, cleans up and exits. The required part is setting up bitplane DMAs, copper, screen, blitter and the color values:

SysSetup:
 move.w #$0000,$dff1fc ; FMODE, slow fetch mode for AGA compatibility
 move.w #$0002,$dff02e ; COPCON, enable danger mode
 move.l #Copper1,$dff080 ; COP1LCH, copper 1 pointer
 move.l #Copper2,$dff084 ; CPO2LCH, copper 2 pointer
 move.w #$0000,$dff088 ; COPJMP1, restart copper at location 1
 move.w #$2c81,$dff08e ; DIWSTRT, screen upper left corner
 move.w #$f4c1,$dff090 ; DIWSTOP, screen lower right corner
 move.w #$003c,$dff092 ; DDFSTRT, display data fetch start
 move.w #$00d4,$dff094 ; DDFSTOP, display data fetch stop
 ;move.w #$7fff,$dff096 ; DMACON, disable all DMAs
 move.w #$87c0,$dff096 ; DMACON, enable important bits
 move.w #$0000,$dff098 ; CLXCON, TODO
 move.w #$7fff,$dff09a ; INTENA, disable all interrupts
 move.w #$7fff,$dff09c ; INTREQ, disable all interrupts
 move.w #$0000,$dff09e ; ADKCON, TODO
 move.w #$a200,$dff100 ; BPLCON0, two bitplanes & colorburst enabled
 move.w #$0000,$dff102 ; BPLCON1, bitplane control scroll value
 move.w #$0000,$dff104 ; BPLCON2, misc bitplane bits
 move.w #$0000,$dff106 ; BPLCON3, TODO
 move.w #$0000,$dff108 ; BPL1MOD, bitplane modulo for odd planes
 move.w #$0000,$dff10a ; BPL2MOD, bitplane modulo for even planes
 move.w #$09f0,$dff040 ; BLTCON0
 move.w #$0000,$dff042 ; BLTCON1
 move.w #$ffff,$dff044 ; BLTAFWM, blitter first word mask for srcA
 move.w #$ffff,$dff046 ; BLTALWM, blitter last word mask for srcA
 move.w #$0000,$dff064 ; BLTAMOD
 move.w #BLITS,$dff066 ; BLTDMOD
 move.w #$0000,$dff180 ; COLOR00
 move.w #$0aaa,$dff182 ; COLOR01
 move.w #$0a00,$dff184 ; COLOR02
 move.w #$0000,$dff186 ; COLOR03
 move.w #(bpl1>>16)&$ffff,$dff0e0 ; BPL1PTH
 move.w #bpl1&$ffff,$dff0e2 ; BPL1PTL
 move.w #(bpl2>>16)&$ffff,$dff0e4 ; BPL2PTH
 move.w #bpl2&$ffff,$dff0e6 ; BPL2PTL

We set up the space for the bitplanes at $80000:

 ORG $80000
 EVEN
Screen:
bpl1:
 dcb.b BPLSIZE
bpl1E:
bpl2:
 dcb.b BPLSIZE
bpl2E:

Most of the work is done with the copper and blitter. Since the minimig logo is fixed in place, it only needs moving to the proper position in the bitplanes. The rotating ball is also not moving around, so there is no need to clear the bitplanes, we just write new data over the old one. If we want to show the boingball animation with the correct speed, one frame of the animation must be shown for 5 minimig frames (minimig has a refresh rate of 50Hz for PAL). That means quite a long copper list, moving the copper pointer around each frame and the blitter pointer every five frames, since we don’t have a CPU to do any of that. Below is copper code for a single frame of animation, spanning 5 Amiga screen refreshes:

 EVEN
Copper2:
c2f00:
 dc.w $0050,(f0p0>>16)&$ffff
 dc.w $0052,(f0p0)&$ffff
 dc.w $0054,((bpl1+BALLOFF)>>16)&$ffff
 dc.w $0056,((bpl1+BALLOFF))&$ffff
 dc.w $0058,(BLITH<<6+BLITW)
 dc.w $0107,$7ffe
 dc.w $0050,(f0p1>>16)&$ffff
 dc.w $0052,(f0p1)&$ffff
 dc.w $0054,((bpl2+BALLOFF)>>16)&$ffff
 dc.w $0056,((bpl2+BALLOFF))&$ffff
 dc.w $0058,(BLITH<<6+BLITW)
 dc.w $0084,(c2f01>>16)&$ffff
 dc.w $0086,(c2f01)&$ffff
 dc.w $ffff,$fffe
c2f01:
 dc.w $0084,(c2f02>>16)&$ffff
 dc.w $0086,(c2f02)&$ffff
 dc.w $ffff,$fffe
c2f02:
 dc.w $0084,(c2f03>>16)&$ffff
 dc.w $0086,(c2f03)&$ffff
 dc.w $ffff,$fffe
c2f03:
 dc.w $0084,(c2f04>>16)&$ffff
 dc.w $0086,(c2f04)&$ffff
 dc.w $ffff,$fffe
c2f04:
 dc.w $0084,(c2f10>>16)&$ffff
 dc.w $0086,(c2f10)&$ffff
 dc.w $ffff,$fffe

This code is repeated eight times for each frame of the animation.

So, after these two long posts, does it work at all? It sure does:

Whole AsmOne source code is here:

The BoingBall, Part 1

I’ve received some questions about how I made the new minimig logo with the rotating checkered ball, so I thought I’d write something about it for my first post.

The BoingBall is quite famous in the Amiga land, as it was featured in one of the first demos made for the Amiga computer to demonstrate its capabilities at the Consumer Electronics Show in January 1984. It was later used as an official logo of the Amiga. Its roots go even further back, but enough with history. You can watch this video describing how the BoingBall demo came about:


At the time I was replacing the boot code in minimig, and I thought ‘why not add something more dynamic to the boot screen?’ So the idea to use the iconic BoingBall animation was born.

When I was younger, I played around with 3D animation and modeling a lot, mostly with Lightwave on the Amiga and later with 3D Studio running in DOS on a PC. But for this task, a much simpler renderer was used, one that is probably best suited for this task, also one of my favourites – the freely-available POV-Ray. POV-Ray doesn’t have a modeler, you describe the scene in a text file, but for such a simple scene, you wouldn’t need a full-featured modeler anyway.

I won’t go into all the features of POV-Ray language, I’ll just describe the basics needed for the BoingBall animaton. You need three things in a basic scene like this:

  • an object to render
  • a camera, preferably looking at your object
  • lights

You place the camera with a location and a direction statement :

camera
{
  location <0,0,-6>
  look_at <0,0,0>
}

Next, lights – I used both ambient and omni lights around the ball:

global_settings { ambient_light color White }
light_source { <+0, +0, -6> color White }
light_source { <+0, +0, +6> color White }
light_source { <+0, -6, +0> color White }
light_source { <+0, +6, +0> color White }
light_source { <-6, +0, +0> color White }
light_source { <+6, +0, +0> color White }

And then comes the object of interest – a sphere with a red-white checkered pattern:

sphere {
  // placement and size
  <0, 0, 0>, 2
  // texture
  texture {
    pigment {
      // red / white checker pattern
      checker color Red, color White
    }
  }
}

The sphere’s texture needed some fixing of the scale and a warp pattern modifier that wraps the checkered pattern around a sphere:

      // the x-y scaling is a bit off, fixing it together with size
      scale <1.5/pi, 1.0/pi, 1>*0.25
      warp { spherical orientation y dist_exp 1 }

All that is needed now is the animation:

  // rotate (rotate after texture!)
  // 1/8th of full rotation so the texture aligns for a nice animation
  rotate<0,360/8*(clock+0.00),0>
  // adding a slight tilt (after animation rotation!)
  rotate <0,0,-15>

The order of these steps is important in this case – first you apply the texture, than the rotation and the slight tilt the last, otherwise the result will not be what you desired. The clock in the last code snippet is the animation parameter (for a repeating animation you need to calculate how much the ball must rotate in a desired number of frames).

To render this scene, you give POV-Ray some parameters, preferably in an .ini file:

[boingball]
Input_File_Name=boingball.pov
Width=800
Height=600
Antialias=On
Antialias_Threshold=0.3
Initial_Frame=1
Final_Frame=8
Cyclic_Animation = on

And the result:

boingball1

Once you have all of the frames of the animation, you need to parse the image data, transfer it to the Amiga and write a copper list to animate it (yes, the boot logo doesn’t use the CPU, the animation runs using blitter and copper only). But that is a story for another post.

 

Continued in BoingBall, Part 2