qSoC – The QMEM bus

QMEM bus specification

This post describes the QMEM bus, the different cycles allowed, the bus elements and different bus configurations supported.

1. Introduction
2. Features
3. Signals Description
4. Cycles Description
5. Bus Elements
6. Bus Configurations

Introduction

QMEM (abbreviated from quick memory), is a flexible, portable, simple and fast system interconnect bus, specifically targeted at SoC systems for their inter-chip communication needs.

QMEM is based on synchronous memory bus with added flow control signals, which makes it very simple and fast. The origin of QMEM is the OR1200 open-source CPU implementation, where it was used as a tightly-coupled memory (TCM) bus inside the CPU.

Features

  • flexible: endian-independent, supports different data and address widths, flexible access speeds, flexible interconnect methods like point-to-point, shared bus, multi-layered interconnect
  • portable: fully vendor-, tool-, language- and technology independent
  • simple: based on synchronous memory bus with added flow control, it is the simplest bus with minimal bus interconnect logic
  • fast: fully pipelined, single cycle reads and writes, with no setup or end cycles
  • extensible: allows any number of transfer tags added to support different features, like master identification, slave error reporting, etc

Signals description

  • cs – master output signal denoting valid master cycle when cs=’1′, and idle cycle when cs=’0′
  • we – master read/write select signal, denotes write cycle when we=’1′, and read cycle when we=’0′
  • sel – master byte select signal, one bit for each byte in data words, generally only taken into account during write cycles and ignored during read cycles, sel=’1111′ denotes four bytes, sel=’0011′ denotes lower two bytes, sel=’0100′ denotes single byte
  • adr – master address signal
  • dat_w – master write data
  • dat_r – slave read data
  • ack – slave cycle acknowledge, asserted and valid when master cs=’1′

All signals are active-high. Optionally, clock (clk) and reset (rst) signals can be considered part of the QMEM bus, especially if the bus uses different clock domains that the rest of master or slave logic. Other common optional signals are slave error response (err), and the master id signal (mid).

The master can start the cycle at any time (synchronously to the clock), by asserting cs=’1′, and set any other signals as appropriate. The cycle ends with the slave acknowledge (ack=’1′). After the slave acknowledges the cycle, the master is free to start a new cycle immediately, by keeping cs=’1′, or going to an idle state by asserting cs=’0′.

Cycles description

Reset condition

In reset state (rst=’1′), all QMEM bus signals should be ignored, and their state can be undefined. The first cycle out of reset should be initialized as an IDLE cycle.

IDLE cycle

IDLE cycle is denoted when master cs=’0′ and slave ack=’0′. There is no activity on the bus, other than the possible slave read data (dat_r), if the previous cycle was a read cycle.

QMEM reset state & idle cycle

QMEM reset state & idle cycle

WRITE cycles

A WRITE cycle is denoted when master asserts cs=’1′, we=’1′, and puts the desired address on adr, the byte-select on sel and data to be written on dat_w. The master must not change any of its signals, or stop the cycle by asserting cs=’0′, without receiving the slave acknowledge (ack=’1′) first. The slave can insert any number of delay cycles by holding ack=’0′ while the master asserts cs=’1′, until it is ready to service masters’ request. The master can start a new cycle immediately after synchronously detecting slave acknowledge response (ack=’1′).

QMEM single write cycle with no delay

QMEM single write cycle with no delay

QMEM single write cycle with 1 cycle delay

QMEM single write cycle with 1 cycle delay

QMEM multiple write cycles with no delay

QMEM multiple write cycles with no delay

QMEM multiple write cycles with 1 cycle delay

QMEM multiple write cycles with 1 cycle delay

READ cycles

A READ cycle is denoted when master asserts cs=’1′, we=’0′, and puts the desired address on adr, and the byte-select on sel . The master must not change any of its signals, or stop the cycle by asserting cs=’0′, without receiving the slave acknowledge (ack=’1′) first. The slave can insert any number of delay cycles by holding ack=’0′ while the master asserts cs=’1′, until it is ready to service masters’ request. The master can start a new cycle immediately after synchronously detecting slave acknowledge response (ack=’1′).

QMEM single read cycle with no delay

QMEM single read cycle with no delay

QMEM single read cycle with 1 cycle delay

QMEM single read cycle with 1 cycle delay

QMEM multiple read cycles with no delay

QMEM multiple read cycles with no delay

QMEM multiple read cycles with 1 cycle delay

QMEM multiple read cycles with 1 cycle delay

MIXED cycles

A QMEM master is free to mix READ, WRITE and IDLE cycles any way it chooses. The slave must be ready to respond to a master WRITE cycle, even if it is in the same clock period as the previous cycle’s master read request, since reads are pipelined.

QMEM mixed cycles with no delay

QMEM mixed cycles with no delay

QMEM mixed cycles with 1 cycle delay

QMEM mixed cycles with 1 cycle delay

ERROR cycle

The QMEM bus has an optional err signal. The slave must keep this signal tied to ground (err=’0′), unless it wishes to communicate an error condition to the master. The slave can do that by asserting err=’1′ at the same time it is acknowledging the cycle with ack=’1′. Usually, the error signal is high if the slave is in reset. Another case where a slave might raise the error condition, is if the master is trying to address a memory or register that is bigger than the size of the slave memory.

QMEM error cycle

QMEM error cycle

QMEM bus elements

QMEM bus has four major bus components: masters, slaves, arbiters and decoders.

QMEM master

A QMEM master is a master device on the bus. It can start cycles, set bus direction and number of bytes affected, sets data to be written or reads data from slaves.

An example of a QMEM master (non-synthesizeable) can be seen here: qmem_master.v

QMEM slave

A QMEM slave responds to master cycles, either writing the master data to its memory or registers, or reading its memory or registers and sending them to the master.

An example of a QMEM slave (non-synthesizeable) can be seen here: qmem_slave.v

QMEM arbiter

An arbiter is a bus element that decides which master has access to a slave device in a given cycle. Each QMEM slave that is accessed by multiple masters must have an arbiter. An arbiter can grant masters access to the slave on priority-basis, it can use a round-robin scheme, or a combination of the two.

An example of a priority-based QMEM arbiter (synthesizeable) can be seen here: qmem_arbiter.v

QMEM decoder

An decoder is a bus element that directs master requests to an appropriate slave, based on a slave decoding scheme, which is usually address-based. Each QMEM master that accesses multiple slaves must have a decoder. This only applies to immediate connections, so in a shared bus configuration where the masters connect to a single arbiter (= single slave), there is no need for a decoder on the master’s side (the slave side of the arbiter in this bus configuration could still have a slave decoder attached, if there are multiple slaves).

An example of a QMEM decoder (synthesizeable) can be seen here: qmem_decoder.v

There are other elements that can be attached to a QMEM bus, like a bus register stage, which can register master, slave, or both side of the bus, a bus monitor that validates bus signals according to the rules, and bus converters, which can convert form and to QMEM bus from other bus architectures, like APB, AHB and Wishbone. These still need to be written, so I’ll write a

QMEM bus configurations

A QMEM bus can be built in multiple different configurations, depending on the speed or logic utilization needs. Some of the common configurations are: shared bus, point-to-point and multilayer.

In the following graphs, the mX represents masters, aX aribters, dX decoders and sX slaves.

QMEM shared bus

QMEM shared bus

QMEM point-to-point

QMEM point-to-point bus

QMEM multilayer bus

QMEM multilayer bus

CPU (from: https://www.flickr.com/photos/2top/10402551773/)

qSoC – The OR1200 CPU

The OR1200 is a RISC-type, Harvard architecture (separate instruction and data buses) synthesizable CPU core, written by the OpenCores community.

It can be configured with a number of optional components, such as cache, MMU, FPU, timer, programmable interrupt controller, debug unit, etc. For sake of simplicity, I decided to disable most of the optional components, except the hardware multiplier and divider, all other features will be added if/when needed.

The OR1200 has standard GNU tools available, like gcc, binutils, and a few standard libraries, like uClibc and newlib. There is also an official port of linux available.

One of the requirements was that the CPU would be able to run with a 50MHz frequency on a CycloneIII – class FPGA. The hardware divider implementation used in OR1200 is a 8-cycle divider, which was on a critical timing path, and needed to be reduced to a 16-cycle divider, which, while slow, is still heaps faster than a software division implementation. The changes are in this commit: 01a18ffbbb86b074. There’s a testbench for the updated division code in this commit: 8232f830c60a1166.

There are other settings (defines) available in the or1200_defines.v file, you can tune some of them to get a better utilization or speed from the code. One of the things that I changed to get some more speed was the type of the compare used in the ALU (the change is in this commit: 337217f9511888c2).

The OR1200 version used here is not exactly in sync with the official OR1200 repository, as this core was split from the original a long time ago and changed a lot and the changes didn’t propagate back to the original repo. One of the important changes, besides some bug fixes, is the different bus – QMEM, replacing the original Wishbone bus.

With most of the optional components removed, and with the QMEM bus, the OR1200, as used in this project, is a nice, small but fast little softcore CPU.

SoC

qSoC, or how to build an FPGA SoC from scratch

Introduction

I’d like to talk about how to build a fast, lean, clean SoC machine. What is an SoC? An SoC, or a System On a Chip, is, simply put, a microprocessor with some common peripherals attached, like ROM, RAM, SDRAM controller, UART, SPI, GPIO, and other I/O ports or protocols, all tied up into a system. An SoC is not unlike a microcontroller, which also has a microprocessor bundled with some sort of memory and I/O, the distinction is more or less cosmetical.

An SoC can be used standalone, running barebone or a form of an operating system and communicating with the world outside of the FPGA, or alternatively, it can be used as a part of a larger system, incorporating bigger logic blocks and acting as a sort of support for it. The latter will be the focus of this and following posts.

What I want from the SoC is small footprint (small consumption of FPGA resources), good processing speed, and a well-defined interface to other parts of the FPGA or the outside world. What I have in mind will need at least these components:

  • CPU
  • a well-defined bus interface
  • ROM for the bootloader or the whole firmware
  • RAM, either internal or external SDRAM
  • ‘registers’, which is a catch-all phrase for access to memory-mapped I/O and other functions

CPU

So, let’s start with the CPU. There are a bunch of open-source CPU architectures available, and I’ll definitely want to try and incorporate more than just one into this SoC, so there’s a choice of power vs. resource consumption. But for now, I’ll start with the OpenCores OR1200 (or a slightly modified version of it), since that is the CPU core I’m most familiar with. Later on, I’ll definitely look into adding at least an AVR, an ARM and a RISC-V variants to this SoC.

The OR1200 is a nice, small and relatively simple Harvard architecture RISC CPU, loosely based on the MIPS architecture and instruction set. It is pretty configurable, so you can for example disable the hardware multiplication and division support to save logic gates, and run a software implementation instead. It also has support for cache and MMU, and an optional FPU unit, but I’m going to strip it of all those (unnecessary) addons and just keep the CPU core.

Bus

Next on the list is the bus interface. In my opinion, a well-defined bus interface is a cornerstone of a good SoC design, and must not be overlooked or brushed over quickly. All of the components in an SoC will communicate through this bus, so it best that it is well-designed and thought-through at the beginning, so there are no strange errors popping up all of a sudden, if you add a component somewhere down the line. There are quite some bus architectures intended for an SoC interconnect to choose from, like Wishbone, APB, AHB, etc., but I chose the QMEM bus, especially for its simplicity and speed. As will be explained later, the QMEM bus is not much more than a standard synchronous memory bus with added flow control and optional tags attached to it.

Memory

There’s not much to say about memory, besides that it is needed. At the very least, a small ROM is needed for the bootloader or a sort of a monitor program, that can write and read to memory and registers, and load firmware from the serial port or SD card, but I’ll talk about that later. Besides the ROM, the CPU will need some form of RAM, either from FPGA’s internal memory blocks, or an external memory like SRAM or SDRAM. The SoC should preferably support all of these variants.

I/O

The SoC will need support for some standard external communication protocols. The most important one is an UART, or a serial port, which can be used for debugging, controlling the SoC system with the help of the monitor firmware, uploading of new firmwares, etc. Another very useful protocol is an SPI master, so the SoC can talk to an SD card and load files from it. A GPIO (general purpose I/O controller) can be added to the SoC, so a range of pins can be controlled for digital I/O. I plan to add support for many other I/O channels later on, like a VGA controller, audio output, including sigma-delta DAC and DAC IC support, and other such interfaces.

Other important components

The SoC needs a couple of other standard components, like clock management, reset synchronization module, an interrupt controller, and a timer or two.

Frequency considerations

Another thing to consider is at what frequencies the SoC should run. Usually, in an ASIC product, this would need to be a balance between power consumption and processing speed, limited by the particular process node limitations. Luckily, for an FPGA project, power consumption is not so important, especially taking into account the large static power consumption of FPGAs, so the frequency can be more easily selected based on actual needs.

Personally, I like the trio of 25/50/100MHz frequencies, and I’ll explain what I mean and why.

An SoC will very probably contain an SDRAM controller, and the 100MHz is the max frequency of many SDRAM ICs. Of course, many can run at higher frequencies, up to 166MHz, but the 100MHz is a safe bet to work with any SDRAM IC.

Next up, the 50MHz is in my opinion a nice operating speed for most of the logic in the FPGA, as it nicely balances the required flip-flops for the asynchronous logic to work at this frequency, without wasting a lot of LEs. Any CPU for the qSoC should be capable to work (at least) at this frequency.

The 25MHz is the maximum speed an SD card operates over the SPI bus, and the 25MHz clock can be generated from toggling flip-flops running on the 50MHz clock obviously.

The 50MHz and 25MHz are also frequencies that can be used as pixel clocks for two VGA video resolutions: a 640×480 VGA with 25MHz pixel clock and a 800×600 SVGA with 50MHz clock.

There’s another benefit to using the frequencies that are nice multiples – you can program the PLL in the FPGA to generate these frequencies that are synchronous to each other, which means that you don’t have to use clock-domain crossing logic, since all three clocks can be edge-aligned. This way, you can save gates, plus you don’t have to deal with the headaches that CDC will definitely bring.

Code repository & structure

The repository for this SoC experiment (which I named qSoC for Qmem SoC) is here:

https://github.com/rkrajnc/qsoc

Currently, there’s only OR1200 CPU RTL code there, together with some QMEM modules and a testbench for the OR1200 divider.

I like to keep all files in the repository nicely organized into these directories:

  • rtl – for all common synthesizable Verilog / VHDL code
  • fw – for all common CPU firmware code
  • tools – for all tools / scripts needed for building / converting etc. files needed for the qSoC
  • bench – for Verilog top benchmark files
  • ver – for any scripts needed for verification / benchmarking
  • fpga – for any FPGA board specific files, like Quartus or ISE project files, sdc files etc.

The directory structure might change in the future, as I’d like to keep the projects I build with this SoC in the same repo, so I’ll probably have to add a project directory, with any project-specific files, but I’ll cross that bridge when I get to it.

Boards

I plan to support at least these boards:

More could be added in the future.

Planned projects

The first project I plan to make is a simple WAV audio player, so I’ll be able to test the sigma-delta implementation used in the minimig. After that, probably something involving a VGA controller, a character generator, perhaps a whole system capable of running simple 2D SDL games. Another interest is definitely some sound processing / generation projects, like a MIDI synthesizer, or an FPGA guitar effect. We’ll see.

 

Coming next up, a few more words on the selected CPU and its modifications.

Upgrading minimig with AGA capabilities, Part 2

So, to continue from Part 1, the rest of additions in Denise.

Denise bitplane updates

First, of  course, the bitplane count needs to be upgraded from ECS’ 6 bitplanes to AGA’s 8 bitplanes, together with everything dependent on it, like collision detection for 8 bitplanes, two additional bitplane output buffer registers, playfield support, etc. Quite a lot of code changes in this step, but not hard to do – basically, I just followed the existing code and used common sense and the AGA.guide document.

The tricky part is the 64bit bitplane shifter and accompanying logic. This requires upgrading the existing 16bit output shifter to 64bits, together with new scroller implementation. I spent quite some time on this, and I know it doesn’t work correctly yet, as some games scroll very strangely – moving a few pixels in one direction and then ‘jumping’ back a lot of pixels and so on. The scroller will definitely need to be fixed sometime in the future. The scroller / shifter implementation seems to be dependent on selected resolution *and* the fetch mode, I just don’t see how it fits together. There’s also some undocumented ‘extra_delay’ involved here, I just don’t understand it yet nor see where it comes from.

Denise sprite updates

Sprites are somewhat easier, once you realize how the 64bit fetches fit in. The sprite output data gets extended from four bits to eight bits by adding four new OSPRM/ESPRM bits to MSB of the output. The trick is that the 64bit fetches are only for the sprite data, not for the sprite control word – I figured that out after reading some demo coder description of how the sprite control & data need to be laid out in memory for wide sprites to work. Still missing is the part of sprite scandoubling that can ‘double’ the sprite horizontally, I haven’t figured that out yet.

Agnus bitplane DMA updates

This is the part I spent the most time on. Agnus contains a bitplane DMA sequencer, that does different DMA sequences according to resolution and selected number of bitplanes. To support AGA’s eight bitplanes and two additional fetch modes, the sequencer had to be extended with the new sequences. I studied this a lot, used many pieces of paper drawing diagrams and sequences which could work. In the end, I decided for the simplest approach with the least changes required – as I did for every other AGA change, and it seemed to work. It works very simply now: the sequencer has five ‘programs’, which includes all resolutions (lores, hires & shres) and all three fetch modes. All odd planes come first, followed by even planes, with plane 0 being always the last, since fetching plane 0 starts the parallel-to-serial converters in Denise (and enables sprites!). Another trick is that two of the sequences have free cycles following bitplane DMA cycles. The sequence length is also increased from ECS’ eight cycles to AGA’s 32 cycle sequences.

I made a program visualizing the five different encodings of the sequencer:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

int main()
{
unsigned int ddfseq;

printf("old sequencer:\n");
printf("ddfseq shres hires lores\n");
for (ddfseq=0; ddfseq<8; ddfseq++) {
unsigned int ddfseq_neg = (~ddfseq);
unsigned int shres = (((ddfseq_neg&1)?1:0)<<0);
unsigned int hires = (((ddfseq_neg&1)?1:0)<<1) | (((ddfseq_neg&2)?1:0)<<0);
unsigned int lores = (((ddfseq_neg&1)?1:0)<<2) | (((ddfseq_neg&2)?1:0)<<1) | (((ddfseq_neg&4)?1:0)<<0);
printf("%01d %01d %01d %01d\n", ddfseq, shres, hires, lores);
}
printf("\n");

printf("new sequencer:\n");
printf(" mode 1 = 2-fetch sequence (SHRES FMode = 0)\n");
printf(" mode 2 = 4-fetch sequence (HRES FMode = 0, SHRES FMode = 1)\n");
printf(" mode 3 = 8-fetch sequence (LRES FMode = 0, HRES FMode = 1, SHRES, FMode = 3)\n");
printf(" mode 4 = 8-fetch sequence followed by 8 free cycles (LRES FMode = 1, HRES FMode = 3)\n");
printf(" mode 5 = 8-fetch sequence followed by 24 free cycles (LRES FMode = 3)\n");
printf("ddfseq 01 02 03 04 05\n");
for (ddfseq=0; ddfseq<32; ddfseq++) {
unsigned int ddfseq_neg = (~ddfseq);
unsigned int m1 = (((ddfseq_neg&1)?1:0)<<0);
unsigned int m2 = (((ddfseq_neg&1)?1:0)<<1) | (((ddfseq_neg&2)?1:0)<<0);
unsigned int m3 = (((ddfseq_neg&1)?1:0)<<2) | (((ddfseq_neg&2)?1:0)<<1) | (((ddfseq_neg&4)?1:0)<<0);
unsigned int m4 = (((ddfseq &8)?1:0)<<3) | (((ddfseq_neg&1)?1:0)<<2) | (((ddfseq_neg&2)?1:0)<<1) | (((ddfseq_neg&4)?1:0)<<0);
unsigned int m5 = (((ddfseq &16)?1:0)<<4) | (((ddfseq &8)?1:0)<<3) | (((ddfseq_neg&1)?1:0)<<2) | (((ddfseq_neg&2)?1:0)<<1) | (((ddfseq_neg&4)?1:0)<<0);
printf("%02d %02d %02d %02d %02d %02d\n", ddfseq, m1, m2, m3, m4, m5);
}

exit(EXIT_SUCCESS);
}

I made one mistake in the bitplane DMAs implementation that took me a while to fix – I forgot to change how much the bitplane pointers advanced according to fetch mode. I made numerous tests in AsmOne trying the 64bit bitplane fetches, and they simply didn’t work, but after many reviews of the code I saw my mistake – the code needs to advance the bitplane pointers by 1 for 16bit fetches, 2 for 32bit fetches and 4 four 64bit fetches. Duh 😉 After fixing the bitplane modulos, the bitplane DMAs were ready.

Agnus sprite updates

After figuring out the bitplane DMA updates, the sprite DMA was pretty easy – just advance the sprite data pointers by an appropriate amount, at that is it!

Aaand …. done (almost)!

One of the last thing I changed was the Denise ID register, updating it with the proper AGA ID value. Once you do that and start Workbench, SetPatch will detect an AGA Amiga and try to enable 64bit bitplane fetch mode. Let’s just say the result wasn’t very good at first, but after lots of small bits fixed here and there, I got a perfect-looking desktop with 4x fetch mode enabled! Even some games worked, although most of them were a little slow, since the CPU bandwidth to the chipRAM and kickstart is still as slow as on ECS Amigas (the CPU on AGA Amigas can access chipRAM through a 32bit bus, also kickstart uses two ROM ICs, making kickstart accessible through a 32bit bus also).

That concludes the current minimig AGA implementation. There are still some missing features, like bitplane & sprite scandoubling, plus extending the CPU speed when accessing chipRAM & kickstart, like mentioned above. All in all, I spent around two weeks working on this, a week of afternoon work for upgrading the minimig design to a single 28MHz clock and a week when I was on sick leave for most of the AGA stuff. The bitplane DMAs and the bitplane output shifters were definitely the most tricky parts to get right.

And for any reader that managed to read through this very long two-post rambling – congratulations for your persistence! I’ll invite you to a beer if we ever happen to meet!

If you want to take a look at the minimig code, my repository is here: https://github.com/rkrajnc/minimig-mist