resource-optimised-DCT/Design_and_analysis.md

# 2D DCT-II algorithm hardware-software codesign for image compression in RISC-V satellite processor

Hugo Mårdbrink, 2024

## Introduction

This project aims to design and implement a 2D DCT-II algorithm in hardware and software for image compression in a RISC-V satellite processor.
There is currently a rise of RISC-V processors, [notably in the space industry](https://gaisler.com/index.php/products/processors/noel-v#DOC).
The 2D DCT-II algorithm transforms blocks of pixels into blocks of frequency coefficients. This compresses the image by removing spatial redundancy.
Thus, the algorithm is used in spacecraft to decrease the amount of image data that needs to be transmitted back to Earth.

## Goals

Since the environment in space is limited, the design needs to focus on an energy efficient design using a small hardware area.
This alters the focus of the codesign to prefer energy efficiency over throughput or execution time.
However, the aspect of fast execution times is still highly relevant and a good balance between the two needs to be explored.
Notably, in current RISC-V space processors there are no vector processing units, making this an interesting aspect.
## Method

### Development and evaluation

The software will be compiled and built in C using the GCC RISC-V compiler. For vectorisation, vector intrinsics for RVV will be used from a C header file.
For parallelisation, the [OpenMP library](https://www.openmp.org/) will be used.
To test and evaluate the software implementation, it will run in the gem5 simulator. The hardware configuration is also done in configuration files for gem5.
The mock data for the images will be generated in C with nonsensical values. This does not matter since different values will not affect the run time.
When measuring the performance the sequential time of generating mock data and freeing the memory will be deducted for a true performance reflection.
For the parts where problem size will increase, performance will be measured by cycles per DCT block.

### Building

To run the build command for the software, the following base command is used:

```bash
riscv64-unknown-elf-gcc -march=rv64imadcv -mabi=lp64d main.c -o dct2d_riscv.out
```

The following flags will be used based on what functionality is needed:

- `-lm` for math library
- `-libomp` for OpenMP library
- `-O[level]` for different optimisation levels
- `-march=rv64imadcv` for the RISC-V ISA
- `-mabi=lp64d` for the RISC-V ABI

### Simulating
To simulate the software on different hardware configurations gem5 is used.
Gem5 allows for different hardware configurations to be tested using a python script.
The python script for this project is tailored for this project specifically, thus, 5 parameters are custom to this project for ease of use:

- `--l1i` for the L1 instruction cache size
- `--l1d` for the L1 data cache size
- `--l2` for the L2 cache size
- `--vlen` for the vector length
- `--elen` for the element length
- `--cores` for the number of cores

To run the simulation and output the result, the following command is used:

```bash
../gem5/build/RISCV/gem5.opt -d stats/ ./riscv_hw.py --l1i 16kB --l1d 64kB --l2 256kB --vlen 256 --elen 64 --cores 1
```

## Implementation

### Initial hardware configuration
For the initial and naive software implementation, some hardware configurations are set. These are:
- L1 instruction cache size: 16kB
- L1 data cache size: 64kB
- L2 cache size: 256kB
- Vector length: 256 bits
- Element length: 64 bits
- Number of cores & threads: 1
- L1 cache associativity: 2
- L2 cache associativity: 8

These are by no means optimal configurations, but to find the optimal, we need to first optimise the software implementation.
Making these configurations good enough as a baseline.

### Constants and definitions
Throughout the code, several constants and definitions are defined for ease to try different configurations. These are defined in the following way:
- `DCT_SIZE` is the size of the DCT block
- `TOTAL_DCT_BLOCKS` is the total amount of DCT blocks and, thus, the problem size.
- `NUM_THREADS` is the amount of threads to use for parallelisation.
- `element_t` is the data type of the elements in the DCT block.
- `real_t` is the data type of the real variables of the algorithm.

### Mock data generation

To start testing our algorithm we need a way to generate data for reliable performance results.
This will be done by allocating DCT-blocks heap memory and filling them with data.
It's important to actually generate all the data and not reuse the same matrices to get realistic cache hits and misses.
The memory allocation is done in the following way:

```c
element_t ***mock_matrices = (element_t ***) malloc(TOTAL_DCT_BLOCKS * sizeof(element_t**));
for (int i = 0; i < TOTAL_DCT_BLOCKS; i++) {
  mock_matrices[i] = (element_t **) malloc(DCT_SIZE * sizeof(element_t*));
  for (int j = 0; j < DCT_SIZE; j++) {
    mock_matrices[i][j] = (element_t *) malloc(DCT_SIZE * sizeof(element_t));
  }
}
```

And the data is generated using the following code:

```c
for (long i = 0; i < TOTAL_DCT_BLOCKS; i++) {
    for (int j = 0; j < DCT_SIZE; j++) {
        for (int k = 0; k < DCT_SIZE; k++) {
            mock_matrices[i][j][k] = j + k;
        }
     }
  }
```

### Naive
The first iteration of this code (see below) will use a naive implementation of DCT-II, along with no optimisations.

```c
void dct_2d(element_t** matrix_in, element_t** matrix_out) {
    real_t cu, cv, sum;
    int u, v, i, j;

    for (u = 0; u < DCT_SIZE; u++) {
        for (v = 0; v < DCT_SIZE; v++) {
            cu = u == 0 ? 1 / sqrt(DCT_SIZE) : sqrt(2) / sqrt(DCT_SIZE);
            cv = v == 0 ? 1 / sqrt(DCT_SIZE) : sqrt(2) / sqrt(DCT_SIZE);

            sum = 0;
            for (i = 0; i < DCT_SIZE; i++) {
                for (j = 0; j < DCT_SIZE; j++) {
                    sum += matrix_in[i][j] * cos((2 * i + 1) * u * PI / (2 * DCT_SIZE)) * cos((2 * j + 1) * v * PI / (2 * DCT_SIZE));
                }
            }
            matrix_out[u][v] = cu * cv * sum;
        }
    }
}
```
This version will serve as a baseline for further optimisations and after simulating this, it yielded a performance of 62977442 cycles.

### Software optimisations
## Compile time constants
Looking at the naive implementation we can see some low hanging fruit that can be easilty optimised by evaluating constants in compile time.
Firstly we can calculate the value of 1/sqrt(DCT_SIZE) and sqrt(2)/sqrt(DCT_SIZE) to avoid executing sqrt() in runtime which is a costly operation.
After doing this we get the following constants:

```c
#define INV_SQRTDCT_SIZE (real_t) 0.3535533906
#define SQRT2_INV_SQRTDCT (real_t) 0.5
```

Because PI / (2 * DCT_SIZE) is a constant we can calculate all possible cos() values from 0 to 32 with this multiple.
These values can then be stored in an array to eliminate runtime calculations. This is done in also done in compile time in the following way:

```c
#define DCT_COS_TABLE_SIZE 32

#define DCT_COS_TABLE (double[DCT_COS_TABLE_SIZE]) { \
    1, 0.980785, 0.92388, 0.83147, 0.707107, 0.55557, 0.382683, \
    0.19509, 0, -0.19509, -0.382683, -0.55557, -0.707107, -0.83147, \
    -0.92388, -0.980785, -1, -0.980785, -0.92388, -0.83147, -0.707107, \
    -0.55557, -0.382683, -0.19509, 0, 0.19509, 0.382683, 0.55557, \
    0.707107, 0.83147, 0.92388, 0.980785 }
```
This changes the way the sum is calculated to the following:
```c
sum += matrix_in[i][j] * DCT_COS_TABLE[((2 * i + 1) * u) % DCT_COS_TABLE_SIZE *DCT_COS_TABLE[((2 * j + 1) * v) % DCT_COS_TABLE_SIZE];
```

After eliminating unnecessary calculations we can move some calculation to other loops to reduce redundant calculations.
These are found in the inner loops of the alogrithm where they should be recalculated for each other iteration,
but are instead recalculated in the inner leading to redudant operations.
```c
for (u = 0; u < DCT_SIZE; u++) {
    for (v = 0; v < DCT_SIZE; v++) {
        cu = u == 0 ? 1 / sqrt(DCT_SIZE) : sqrt(2) / sqrt(DCT_SIZE);
        cv = v == 0 ? 1 / sqrt(DCT_SIZE) : sqrt(2) / sqrt(DCT_SIZE);

        sum = 0;
        for (i = 0; i < DCT_SIZE; i++) {
            for (j = 0; j < DCT_SIZE; j++) {
                sum += matrix_in[i][j] * DCT_COS_TABLE[((2 * i + 1) * u) % DCT_COS_TABLE_SIZE *DCT_COS_TABLE[((2 * j + 1) * v) % DCT_COS_TABLE_SIZE];
            }
        }
        matrix_out[u][v] = cu * cv * sum;
    }
}
```
The first step is to move the cu assignment to the outer loop, this will eliminate 7 redundant calculations of cu.
Secondly the sum calculation can be refactored to only lookup the cos for the u value in the outer loop and the v value in the inner loop.
By applying these changes we get the following code:

```c
for (u = 0; u < DCT_SIZE; u++) {
    cu = u == 0 ? INV_SQRTDCT_SIZE : SQRT2_INV_SQRTDCT;
    for (v = 0; v < DCT_SIZE; v++) {
        cv = v == 0 ? INV_SQRTDCT_SIZE : SQRT2_INV_SQRTDCT;
        sum = 0;
        for (i = 0; i < DCT_SIZE; i++) {
            cos_u = DCT_COS_TABLE[((2 * i + 1) * u) % DCT_COS_TABLE_SIZE];
            for (j = 0; j < DCT_SIZE; j++) {
                cos_v = DCT_COS_TABLE[((2 * j + 1) * v) % DCT_COS_TABLE_SIZE];
                sum += matrix_in[i][j] * cos_u * cos_v;
            }
        }
        matrix_out[u][v] = cu * cv * sum;
    }
}
```
After running the changes in the simulation, the performance improved to 26965608 cycles.

## Flattening arrays
Flattening arrays is the process of storing a multidimensional array in a single dimension.
This creates a memory layout that is less jagged, leading to better cache performance and predictability.
It is also necessary for future implementation of vectorision and compiler optimisations.

The first step is to slightly change our data generations to now generate a one dimensional array.
The memory allocation, memory deallocation and data generation now looks like this:

```c
element_t** generate_mock_matrices() {
    element_t **mock_matrices = (element_t **) malloc(TOTAL_DCT_BLOCKS * sizeof(element_t*));
    for (int i = 0; i < TOTAL_DCT_BLOCKS; i++) {
        mock_matrices[i] = (element_t *) malloc(DCT_SIZE * DCT_SIZE * sizeof(element_t));
    }

    populate_mock_matrices(mock_matrices);
    return mock_matrices;
}

void free_mock_matrices(element_t** mock_matrices) {
    for (int i = 0; i < TOTAL_DCT_BLOCKS; i++) {
        free(mock_matrices[i]);
    }
    free(mock_matrices);
}

void populate_mock_matrices(element_t** mock_matrices) {
    for (long i = 0; i < TOTAL_DCT_BLOCKS; i++) {
        for (int j = 0; j < DCT_SIZE; j++) {
            for (int k = 0; k < DCT_SIZE; k++) {
                mock_matrices[i][j * DCT_SIZE + k] = j + k;
            }
        }
    }
}
```

The next step is to change the signature of the kernel function and change the array accessing.
```c
 void dct_2d(element_t* matrix_in, element_t* matrix_out) {
    real_t cu, cv, sum, cos_u, cos_v;
    int u, v, i, j;

    for (u = 0; u < DCT_SIZE; u++) {
        cu = u == 0 ? INV_SQRTDCT_SIZE : SQRT2_INV_SQRTDCT;
        for (v = 0; v < DCT_SIZE; v++) {
            cv = v == 0 ? INV_SQRTDCT_SIZE : SQRT2_INV_SQRTDCT;
            sum = 0;
            for (i = 0; i < DCT_SIZE; i++) {
                cos_u = DCT_COS_TABLE[((2 * i + 1) * u) % DCT_COS_TABLE_SIZE];
                for (j = 0; j < DCT_SIZE; j++) {
                    cos_v = DCT_COS_TABLE[((2 * j + 1) * v) % DCT_COS_TABLE_SIZE];
                    sum += matrix_in[i * DCT_SIZE + j] * cos_u * cos_v;
                }
            }
            matrix_out[u * DCT_SIZE + v] = cu * cv * sum;
        }
    }
}
```
Not only does this enable further optimisations but the performance improved to 23667310 cycles.
## Vectorisation
## Changing data types
## Compiler optimisations