Mandelbrot set

Mandelbrot set on FPGA, Part 2 – Modelling & First FPGA implementation

In the previous post, I introduced my idea of implementing the Mandelbrot set on an FPGA and described the Mandelbrot set equation and the Escape-time coloring algorithm. This post will talk about the implementation.

 

Modeling the Mandelbrot set

I consider it a good idea to start any FPGA implementation work with some planning, and in the case of an algorithm, the first thing you’d want to do is write a simple implementation of it yourself, just so you can verify that your understanding of the algorithm is correct. My implementation is a simple C program using double types, which outputs an image in the PPM format.

The implementation generally does the following:

  • sets up a palette for the image
  • writes the appropriate headers for the image
  • walks over all pixels of the output image
  • maps the current pixel coordinate into the mathematical Mandelbrot space
  • calculates the Mandelbrot set equation for up to maximum iterations, or until the value limit is reached
  • uses the number of iterations as an index into the RGB palette
  • writes the RGB values to the output image

Running this program will produce the following image:

Mandelbrot set

Mandelbrot set

Seems to work just fine!

Since floating-point data types are not really appropriate for FPGAs, you need to consider how to convert the floating-point arithmetic to integer arithmetic, or more specifically, to fixed-point arithmetic. Generally, this can be quite a tiresome step, since you need to carefully consider how many integer / fractional bits to use in each step, when to cut / round bits, weighing in hardware costs, etc.

Luckily, the Mandelbrot set calculation is relatively simple and limited in range, so you can basically eyeball the correct number of bits to use. My implementation is using a fixed point format s4.49 (s represents the single sign bit, the number 4 before the decimal point is the number of integer bits and the number 49 after the decimal point is the number of fractional bits).

For this algorithm, the most critical parts are before the decimal point – we need signed numbers, and we can eyeball the maximum value of the calculations to always be smaller than 16.0, so 4 integer bits are needed. The fractional part seems arbitrary and the reasoning will be explained a little later, but this number represents the precision of our calculations, which are important when zooming into the Mandelbrot set.

Another thing to consider is the maximum number of iterations for a single point. For the moment, I decided to go with 256 maximum iterations, mostly because that number nicely fits in an 8bit variable. But this decision is pretty much arbitrary, as the most appropriate number of iterations is hard to pinpoint. Maybe I’ll revisit this and implement some sort of dynamic approach according to the current ‘scene’ – position and zoom. For now, it should do.

 

FPGA implementation preparations

This is my first time working on the Terasic DE10-nano FPGA board, so some introductory research was needed. Terasic has a lot of documentation on all of its boards available on its site which is worth at least a cursory glance. The most useful one is the board schematic and their example projects.

From the schematic, you can see that the HDMI TX IC used is the Analog Devices’ ADV7513, and that is excellent since the interface to the FPGA TX is pretty much the same as the one for the ADV7123, which is used on a lot of FPGA boards for VGA analog output. The HDMI TX also requires some configuration through the I2C bus, I decided to just reuse the code from the sample HDMI sample project for this part.

Terasic also provides software to create the Quartus project for you, which is great, because manually editing pin constraints is not a fun exercise.

DE10-nano Systembuilder

DE10-nano Systembuilder

Of course, you’d want to first test any new FPGA board with the vendor-provided software – I tried the Linux running on the DE10-nano:

Linux running on the DE10-nano

Linux running on the DE10-nano

Another important document worth a glance is the FPGA device user manual, or in this case, the

. While an FPGA device is generally pretty configurable, there are some fixed blocks inside, which we’ll use generously.

One of the important ‘fixed’ blocks are the memory blocks. From the handbook, you can see that the memory blocks can be configured in 1k x 10 or 1k x 8 sizes, which will be used extensively in this project, as will be described later.

Another important block is the so-called DSP block, which is basically a multiplier with some additional logic. The most important property of the DSP block is the supported input operand width. From the handbook, you can see that a single DSP block supports up to 27×27 multiplier operands, and DSP blocks can be chained together. For better precision in the Mandelbrot calculations, I decided to use two DSP blocks for a single multiplication – this closely matches the precision of a double floating-point datatype on a PC and will provide solid Mandelbrot zooming. If you wondered where that number 49 in the selected fixed-point format (s4.49) above comes from, this is it – since the whole multiplier input is 54 bits, and we need to take into account the sign and integer parts, we can get a maximum of 49 fractional bits from the FPGA hardware.

 

FPGA implementation

Besides the basic scaffolding for the project, you’d usually want to get the clocks, PLLs, and resets working – all three can easily be tested with the Hello World of FPGAs – a LED blinky.

Video

After that, some real work – the video sync generator, which is basically just a couple of counters and comparators – you need to count the video columns and rows and output an active signal together with horizontal and vertical sync signals. You can see my implementation here. If you output the counters from the video sync generator module, together with some helper signals, you can already generate a video output. See my version here, and the output below:

Sample HDMI output

Sample HDMI output

With the video sync signals working, it’s time to decide how to handle the actual pixels. Since the Mandelbrot set implementation has a limited number of iterations as its output, the logical choice is to use a color palette. That implies two sets of memories for the final pixel output – a video index RAM, which holds the palette index (and is basically just a number of iterations for that particular pixel), and a palette memory. The index RAM requires image_width x image_height entries that are 8bits wide and the palette memory then has 2^8 (256) entries and must be 24bits (8+8+8 for RGB) wide. You can see my implementation here.

I also wrote an implementation of a text console, since I wanted to put some textual information on the video output, like the current position & zoom, and the number of iterations and time required to render the whole screen. The text console is in a way similar to the normal pixel output, in that you have a character index RAM and a palette of sorts, in this case, a character ROM memory. What is different is how you access the pixels of the characters since you need to read the same character multiple times – font_width times horizontally, and font_height times vertically. You can see my implementation here, and the output of the console, and some color background below:

Video text console

Video text console

Mandelbrot set calculation

Mandelbrot set calculation can be thought of as consisting of two parts:

  • a location walker, which must go over all pixel coordinates and output the coordinates in mathematical space
  • an iteration calculator, which must calculate the number of iterations for each coordinate

These two parts strongly imply a sort of producer-consumer relation – since the coordinates can be calculated quickly and easily, once you figure out your initial position (upper left corner) and the step in horizontal and vertical directions (Mandelbrot set position and zoom), but the per-pixel iterations take a number of cycles on average. I split the Mandelbrot set calculations into a mandelbrot_coords module and a mandelbrot_calc module, with a FIFO in between.

The coordinate calculation is pretty easy, it basically boils down to a couple of adders:

always @ (posedge clk, posedge rst) begin
  if (rst) begin
    cnt_x   <= #1 VMINX;
    cnt_y   <= #1 VMINY;
    cnt_adr <= #1 'd0;
    cnt_en  <= #1 1'b1;
    man_x   <= #1 MAN_DEF_X0;
    man_y   <= #1 MAN_DEF_Y0;
  end else if (clk_en) begin
    if (init) begin
      cnt_x   <= #1 VMINX;
      cnt_y   <= #1 VMINY;
      cnt_adr <= #1 'd0;
      cnt_en  <= #1 1'b1;
      man_x   <= #1 MAN_DEF_X0;
      man_y   <= #1 MAN_DEF_Y0;
    end else if (out_rdy && cnt_en) begin
      if (cnt_x == VMAXX) begin
        if (cnt_y == VMAXY) begin
          cnt_en  <= #1 1'b0;
        end else begin
          cnt_y   <= #1 cnt_y +'d1;
          cnt_adr <= #1 cnt_adr +'d1;
          man_y   <= #1 man_y + inc_y;
          $display("Mandelbrot: processing line %d", man_y);
        end
        cnt_x <= #1 VMINX;
        man_x <= #1 MAN_DEF_X0;
      end else begin
        cnt_x   <= #1 cnt_x + 'd1;
        cnt_adr <= #1 cnt_adr + 'd1;
        man_x   <= #1 man_x + inc_x;
      end
    end
  end
end

I had a simulation/implementation mismatch here, which took a while to figure out – the problem is that Quartus can only do calculations with Verilog parameters with 32bit numbers, while the Icarus Verilog simulator seems to use larger datatypes (come on Intel, step up your game – Quartus seems really *really* dated compared to Vivado, this problem, really poor SystemVerilog support, … and this is coming from someone who always preferred Intel / Altera FPGAs …).

For the iteration calculation, the implementation is, again,  simple (deceptively so!):

assign xx_mul_comb  = x*x;
assign yy_mul_comb  = y*y;
assign xy_mul_comb  = x*y;
assign xx_comb      = xx_mul_comb[2*FPW-1-FP_S-FP_I:FPW-FP_S-FP_I];
assign yy_comb      = yy_mul_comb[2*FPW-1-FP_S-FP_I:FPW-FP_S-FP_I];
assign xy2_comb     = {xy_mul_comb[2*FPW-2-FP_S-FP_I:FPW-FP_S-FP_I], 1'b0};
assign limit        = {1'h0, 4'h4, {FP_F{1'h0}}}; // 4.0
assign niters_check = niters[IW+1-1:1] >= (MAXITERS-1);
assign limit_check  = (xx + yy) > limit;
assign check        = niters_check || limit_check;
assign x_comb       = xx - yy + x_man_r;
assign y_comb       = xy2 + y_man_r;

always @ (posedge clk) begin
  if (clk_en) begin
    if (in_vld && in_rdy) begin
      adr_o   <= #1 adr_i;
      x_man_r <= #1 x_man;
      y_man_r <= #1 y_man;
      x       <= #1 'd0;
      y       <= #1 'd0;
      xx      <= #1 'd0;
      yy      <= #1 'd0;
      xy2     <= #1 'd0;
      niters  <= #1 'd0;
    end else if(!check) begin
      x       <= #1 x_comb;
      y       <= #1 y_comb;
      xx      <= #1 xx_comb;
      yy      <= #1 yy_comb;
      xy2     <= #1 xy2_comb;
      niters  <= #1 niters + 'd1;
    end
  end
end

I was sure I’ll need to implement some sort of a state machine to handle initialization of the variables and the final check, but, by a lucky coincidence, this code works just fine – there is no need to delay the checking of x^2 + y^2 calculation, as the result starts from 0, so initially, it will always produce correct results.

 

Throwing all the code together will produce the following result:

Mandelbrot set on an FPGA

Mandelbrot set on an FPGA

So here it is, a Mandelbrot set calculated on an FPGA, in glorious 640×480 resolution with 27bit fixed-point precision! See all of my code in my mandelbrot_fpga project on Github, tagged with v0.4.

 

Next steps

With the initial version working, the project still needs some additions to make it better and more useful:

  • increasing the precision to 54bits – this was omitted since Quartus doesn’t like 64bit integers in parameters, and this needs to be handled slightly differently
  • moving the Mandelbrot calculations to a separate, faster clock – this was the point after all – speed! Currently, the ~25MHz video clock is used for all logic
  • multiplying the mandelbrot module to use up all available DSP elements of the (pretty generous) Cyclone V device on the DE10-nano board
  • better pipelining for the multipliers – a couple of registers need to be added, and a single calculation module updated to handle multiple coordinates at once
  • making the logic more configurable – currently, a lot of config is handled with params, which will need to be proper inputs
  • convert the project to be MiSTer-compatible
  • last but certainly not least – add the ability to actually zoom and move around the Mandelbrot set – I’ll probably just add a soft-core CPU to handle this part, with inputs from keyboard/joypad

 

Stay tuned for the next post!

Mandelbrot

Beautiful graphics from simple math – Mandelbrot set on FPGA, Part 1

Introduction to the Mandelbrot set

Like I mentioned in the previous post, I wanted to familiarize myself with the HDMI output on the DE10-nano FPGA board, and there is no harm in getting some pretty graphics along the way, so I decided to implement a Mandelbrot set viewer on FPGA.

If you are not familiar with the Mandelbrot set, it is a type of a fractal, which is simply put an infinitely – repeating recursive pattern. That means that you can zoom into the set and discover ever repeating, self-mirroring patterns. They are also a little weird ;). You can read more about them on the linked Wikipedia article.

The familiar look of the Mandelbrot set is depicted in this picture:

Mandelbrot set

Mandelbrot set

For the formal definition of the Mandelbrot set, I’ll just quote the Wikipedia article:

The Mandelbrot set is the set of values of c in the complex plane for which the orbit of critical point z= 0 under iteration of the quadratic map  remains bounded. Thus, a complex number c is a member of the Mandelbrot set if, when starting with z0 = 0 and applying the iteration repeatedly, the absolute value of zn remains bounded for all n > 0.

For example, for c = 1, the sequence is 0, 1, 2, 5, 26, …, which tends to infinity, so 1 is not an element of the Mandelbrot set. On the other hand, for c = −1, the sequence is 0, −1, 0, −1, 0, …, which is bounded, so −1 does belong to the set.

Which put in simple terms means that the Mandelbrot set can be calculated simply by iteratively calculating the equation for all “interesting” complex numbers , and starting the iteration with . The complex number is expressed as x + iy, where x and y can be mapped to the horizontal and vertical position on the screen. If the iteration doesn’t escape to infinity, then is part of the set.

That handles the “black&white” part. You can then decide how to color the image. A popular (and simple) algorithm you can use is the Escape time algorithm, which selects the color of each individual pixel based on whether the pixel is part of the set or not, and if not, how many iterations it took to “escape”.

The simplified Escape time algorithm can be represented in pseudocode like this:

x2 := 0
y2 := 0

while (x2 + y2 ≤ 4 and iteration < max_iteration) do
  y := 2 × x × y + y0
  x := x2 - y2 + x0
  x2 := x × x
  y2 := y × y
  iteration := iteration + 1

The Mandelbrot set colored with the Escape time algorithm has some properties that are very helpful if you want to calculate it on an FPGA:

  • the set is closed and contained in a closed disk of radius 2 – if the distance from the origin (starting point) is bigger than 2 for the current iteration, then that origin is not part of the set, and you can stop iterating
  • since the set has a limited radius, that means that the range of numbers we need to work with is also limited – important if you want to use integer only math
  • the pixel values are independent of one another – you can calculate all pixels in parallel
  • not a lot of iterations are needed for each pixel
  • the equation to calculate it is simple, probably the simplest of all fractals that I know of

 

So, to implement the algorithm on an FPGA, you can already see the requirements:

  • memory to hold the calculated pixel values – how much depends on the FPGA resources, and the amount of available memory will determine the maximum resolution
  • the maximum number of iterations will be also limited by the memory – using 8, 9 or 10 bit memory width will provide the best resource utilization
  • as can be seen from the pseudocode above, a lot of multipliers will be needed, ideally, you would use all the FPGA DSP elements available
  • the multipliers will need to be quite wide, at least 27 bits, since you’d want to zoom deep into the Mandelbrot set

 

This covers the introduction to the Mandelbrot set. I’ll go deeper into the implementation in the next post. Cheers!

 

qSoC – The QMEM bus

QMEM bus specification

This post describes the QMEM bus, the different cycles allowed, the bus elements and different bus configurations supported.

1. Introduction
2. Features
3. Signals Description
4. Cycles Description
5. Bus Elements
6. Bus Configurations

Introduction

QMEM (abbreviated from quick memory), is a flexible, portable, simple and fast system interconnect bus, specifically targeted at SoC systems for their inter-chip communication needs.

QMEM is based on synchronous memory bus with added flow control signals, which makes it very simple and fast. The origin of QMEM is the OR1200 open-source CPU implementation, where it was used as a tightly-coupled memory (TCM) bus inside the CPU.

Features

  • flexible: endian-independent, supports different data and address widths, flexible access speeds, flexible interconnect methods like point-to-point, shared bus, multi-layered interconnect
  • portable: fully vendor-, tool-, language- and technology independent
  • simple: based on synchronous memory bus with added flow control, it is the simplest bus with minimal bus interconnect logic
  • fast: fully pipelined, single cycle reads and writes, with no setup or end cycles
  • extensible: allows any number of transfer tags added to support different features, like master identification, slave error reporting, etc

Signals description

  • cs – master output signal denoting valid master cycle when cs=’1′, and idle cycle when cs=’0′
  • we – master read/write select signal, denotes write cycle when we=’1′, and read cycle when we=’0′
  • sel – master byte select signal, one bit for each byte in data words, generally only taken into account during write cycles and ignored during read cycles, sel=’1111′ denotes four bytes, sel=’0011′ denotes lower two bytes, sel=’0100′ denotes single byte
  • adr – master address signal
  • dat_w – master write data
  • dat_r – slave read data
  • ack – slave cycle acknowledge, asserted and valid when master cs=’1′

All signals are active-high. Optionally, clock (clk) and reset (rst) signals can be considered part of the QMEM bus, especially if the bus uses different clock domains that the rest of master or slave logic. Other common optional signals are slave error response (err), and the master id signal (mid).

The master can start the cycle at any time (synchronously to the clock), by asserting cs=’1′, and set any other signals as appropriate. The cycle ends with the slave acknowledge (ack=’1′). After the slave acknowledges the cycle, the master is free to start a new cycle immediately, by keeping cs=’1′, or going to an idle state by asserting cs=’0′.

Cycles description

Reset condition

In reset state (rst=’1′), all QMEM bus signals should be ignored, and their state can be undefined. The first cycle out of reset should be initialized as an IDLE cycle.

IDLE cycle

IDLE cycle is denoted when master cs=’0′ and slave ack=’0′. There is no activity on the bus, other than the possible slave read data (dat_r), if the previous cycle was a read cycle.

QMEM reset state & idle cycle

QMEM reset state & idle cycle

WRITE cycles

A WRITE cycle is denoted when master asserts cs=’1′, we=’1′, and puts the desired address on adr, the byte-select on sel and data to be written on dat_w. The master must not change any of its signals, or stop the cycle by asserting cs=’0′, without receiving the slave acknowledge (ack=’1′) first. The slave can insert any number of delay cycles by holding ack=’0′ while the master asserts cs=’1′, until it is ready to service masters’ request. The master can start a new cycle immediately after synchronously detecting slave acknowledge response (ack=’1′).

QMEM single write cycle with no delay

QMEM single write cycle with no delay

QMEM single write cycle with 1 cycle delay

QMEM single write cycle with 1 cycle delay

QMEM multiple write cycles with no delay

QMEM multiple write cycles with no delay

QMEM multiple write cycles with 1 cycle delay

QMEM multiple write cycles with 1 cycle delay

READ cycles

A READ cycle is denoted when master asserts cs=’1′, we=’0′, and puts the desired address on adr, and the byte-select on sel . The master must not change any of its signals, or stop the cycle by asserting cs=’0′, without receiving the slave acknowledge (ack=’1′) first. The slave can insert any number of delay cycles by holding ack=’0′ while the master asserts cs=’1′, until it is ready to service masters’ request. The master can start a new cycle immediately after synchronously detecting slave acknowledge response (ack=’1′).

QMEM single read cycle with no delay

QMEM single read cycle with no delay

QMEM single read cycle with 1 cycle delay

QMEM single read cycle with 1 cycle delay

QMEM multiple read cycles with no delay

QMEM multiple read cycles with no delay

QMEM multiple read cycles with 1 cycle delay

QMEM multiple read cycles with 1 cycle delay

MIXED cycles

A QMEM master is free to mix READ, WRITE and IDLE cycles any way it chooses. The slave must be ready to respond to a master WRITE cycle, even if it is in the same clock period as the previous cycle’s master read request, since reads are pipelined.

QMEM mixed cycles with no delay

QMEM mixed cycles with no delay

QMEM mixed cycles with 1 cycle delay

QMEM mixed cycles with 1 cycle delay

ERROR cycle

The QMEM bus has an optional err signal. The slave must keep this signal tied to ground (err=’0′), unless it wishes to communicate an error condition to the master. The slave can do that by asserting err=’1′ at the same time it is acknowledging the cycle with ack=’1′. Usually, the error signal is high if the slave is in reset. Another case where a slave might raise the error condition, is if the master is trying to address a memory or register that is bigger than the size of the slave memory.

QMEM error cycle

QMEM error cycle

QMEM bus elements

QMEM bus has four major bus components: masters, slaves, arbiters and decoders.

QMEM master

A QMEM master is a master device on the bus. It can start cycles, set bus direction and number of bytes affected, sets data to be written or reads data from slaves.

An example of a QMEM master (non-synthesizeable) can be seen here: qmem_master.v

QMEM slave

A QMEM slave responds to master cycles, either writing the master data to its memory or registers, or reading its memory or registers and sending them to the master.

An example of a QMEM slave (non-synthesizeable) can be seen here: qmem_slave.v

QMEM arbiter

An arbiter is a bus element that decides which master has access to a slave device in a given cycle. Each QMEM slave that is accessed by multiple masters must have an arbiter. An arbiter can grant masters access to the slave on priority-basis, it can use a round-robin scheme, or a combination of the two.

An example of a priority-based QMEM arbiter (synthesizeable) can be seen here: qmem_arbiter.v

QMEM decoder

An decoder is a bus element that directs master requests to an appropriate slave, based on a slave decoding scheme, which is usually address-based. Each QMEM master that accesses multiple slaves must have a decoder. This only applies to immediate connections, so in a shared bus configuration where the masters connect to a single arbiter (= single slave), there is no need for a decoder on the master’s side (the slave side of the arbiter in this bus configuration could still have a slave decoder attached, if there are multiple slaves).

An example of a QMEM decoder (synthesizeable) can be seen here: qmem_decoder.v

There are other elements that can be attached to a QMEM bus, like a bus register stage, which can register master, slave, or both side of the bus, a bus monitor that validates bus signals according to the rules, and bus converters, which can convert form and to QMEM bus from other bus architectures, like APB, AHB and Wishbone. These still need to be written, so I’ll write a

QMEM bus configurations

A QMEM bus can be built in multiple different configurations, depending on the speed or logic utilization needs. Some of the common configurations are: shared bus, point-to-point and multilayer.

In the following graphs, the mX represents masters, aX aribters, dX decoders and sX slaves.

QMEM shared bus

QMEM shared bus

QMEM point-to-point

QMEM point-to-point bus

QMEM multilayer bus

QMEM multilayer bus

CPU (from: https://www.flickr.com/photos/2top/10402551773/)

qSoC – The OR1200 CPU

The OR1200 is a RISC-type, Harvard architecture (separate instruction and data buses) synthesizable CPU core, written by the OpenCores community.

It can be configured with a number of optional components, such as cache, MMU, FPU, timer, programmable interrupt controller, debug unit, etc. For sake of simplicity, I decided to disable most of the optional components, except the hardware multiplier and divider, all other features will be added if/when needed.

The OR1200 has standard GNU tools available, like gcc, binutils, and a few standard libraries, like uClibc and newlib. There is also an official port of linux available.

One of the requirements was that the CPU would be able to run with a 50MHz frequency on a CycloneIII – class FPGA. The hardware divider implementation used in OR1200 is a 8-cycle divider, which was on a critical timing path, and needed to be reduced to a 16-cycle divider, which, while slow, is still heaps faster than a software division implementation. The changes are in this commit: 01a18ffbbb86b074. There’s a testbench for the updated division code in this commit: 8232f830c60a1166.

There are other settings (defines) available in the or1200_defines.v file, you can tune some of them to get a better utilization or speed from the code. One of the things that I changed to get some more speed was the type of the compare used in the ALU (the change is in this commit: 337217f9511888c2).

The OR1200 version used here is not exactly in sync with the official OR1200 repository, as this core was split from the original a long time ago and changed a lot and the changes didn’t propagate back to the original repo. One of the important changes, besides some bug fixes, is the different bus – QMEM, replacing the original Wishbone bus.

With most of the optional components removed, and with the QMEM bus, the OR1200, as used in this project, is a nice, small but fast little softcore CPU.