278 lines
12 KiB
Markdown
278 lines
12 KiB
Markdown
# 2D DCT-II algorithm hardware-software codesign for image compression in RISC-V satellite processor
|
|
|
|
Hugo Mårdbrink, 2024
|
|
|
|
## Introduction
|
|
|
|
This project aims to design and implement a 2D DCT-II algorithm in hardware and software for image compression in a RISC-V satellite processor.
|
|
There is currently a rise of RISC-V processors, [notably in the space industry](https://gaisler.com/index.php/products/processors/noel-v#DOC).
|
|
The 2D DCT-II algorithm transforms blocks of pixels into blocks of frequency coefficients. This compresses the image by removing spatial redundancy.
|
|
Thus, the algorithm is used in spacecraft to decrease the amount of image data that needs to be transmitted back to Earth.
|
|
|
|
## Goals
|
|
|
|
Since the environment in space is limited, the design needs to focus on an energy efficient design using a small hardware area.
|
|
This alters the focus of the codesign to prefer energy efficiency over throughput or execution time.
|
|
However, the aspect of fast execution times is still highly relevant and a good balance between the two needs to be explored.
|
|
Notably, in current RISC-V space processors there are no vector processing units, making this an interesting aspect.
|
|
## Method
|
|
|
|
### Development and evaluation
|
|
|
|
The software will be compiled and built in C using the GCC RISC-V compiler. For vectorisation, vector intrinsics for RVV will be used from a C header file.
|
|
For parallelisation, the [OpenMP library](https://www.openmp.org/) will be used.
|
|
To test and evaluate the software implementation, it will run in the gem5 simulator. The hardware configuration is also done in configuration files for gem5.
|
|
The mock data for the images will be generated in C with nonsensical values. This does not matter since different values will not affect the run time.
|
|
When measuring the performance the sequential time of generating mock data and freeing the memory will be deducted for a true performance reflection.
|
|
For the parts where problem size will increase, performance will be measured by cycles per DCT block.
|
|
|
|
### Building
|
|
|
|
To run the build command for the software, the following base command is used:
|
|
|
|
```bash
|
|
riscv64-unknown-elf-gcc -march=rv64imadcv -mabi=lp64d main.c -o dct2d_riscv.out
|
|
```
|
|
|
|
The following flags will be used based on what functionality is needed:
|
|
|
|
- `-lm` for math library
|
|
- `-libomp` for OpenMP library
|
|
- `-O[level]` for different optimisation levels
|
|
- `-march=rv64imadcv` for the RISC-V ISA
|
|
- `-mabi=lp64d` for the RISC-V ABI
|
|
|
|
### Simulating
|
|
To simulate the software on different hardware configurations gem5 is used.
|
|
Gem5 allows for different hardware configurations to be tested using a python script.
|
|
The python script for this project is tailored for this project specifically, thus, 5 parameters are custom to this project for ease of use:
|
|
|
|
- `--l1i` for the L1 instruction cache size
|
|
- `--l1d` for the L1 data cache size
|
|
- `--l2` for the L2 cache size
|
|
- `--vlen` for the vector length
|
|
- `--elen` for the element length
|
|
- `--cores` for the number of cores
|
|
|
|
To run the simulation and output the result, the following command is used:
|
|
|
|
```bash
|
|
../gem5/build/RISCV/gem5.opt -d stats/ ./riscv_hw.py --l1i 16kB --l1d 64kB --l2 256kB --vlen 256 --elen 64 --cores 1
|
|
```
|
|
|
|
## Implementation
|
|
|
|
### Initial hardware configuration
|
|
For the initial and naive software implementation, some hardware configurations are set. These are:
|
|
- L1 instruction cache size: 16kB
|
|
- L1 data cache size: 64kB
|
|
- L2 cache size: 256kB
|
|
- Vector length: 256 bits
|
|
- Element length: 64 bits
|
|
- Number of cores & threads: 1
|
|
- L1 cache associativity: 2
|
|
- L2 cache associativity: 8
|
|
|
|
These are by no means optimal configurations, but to find the optimal, we need to first optimise the software implementation.
|
|
Making these configurations good enough as a baseline.
|
|
|
|
### Constants and definitions
|
|
Throughout the code, several constants and definitions are defined for ease to try different configurations. These are defined in the following way:
|
|
- `DCT_SIZE` is the size of the DCT block
|
|
- `TOTAL_DCT_BLOCKS` is the total amount of DCT blocks and, thus, the problem size.
|
|
- `NUM_THREADS` is the amount of threads to use for parallelisation.
|
|
- `element_t` is the data type of the elements in the DCT block.
|
|
- `real_t` is the data type of the real variables of the algorithm.
|
|
|
|
### Mock data generation
|
|
|
|
To start testing our algorithm we need a way to generate data for reliable performance results.
|
|
This will be done by allocating DCT-blocks heap memory and filling them with data.
|
|
It's important to actually generate all the data and not reuse the same matrices to get realistic cache hits and misses.
|
|
The memory allocation is done in the following way:
|
|
|
|
```c
|
|
element_t ***mock_matrices = (element_t ***) malloc(TOTAL_DCT_BLOCKS * sizeof(element_t**));
|
|
for (int i = 0; i < TOTAL_DCT_BLOCKS; i++) {
|
|
mock_matrices[i] = (element_t **) malloc(DCT_SIZE * sizeof(element_t*));
|
|
for (int j = 0; j < DCT_SIZE; j++) {
|
|
mock_matrices[i][j] = (element_t *) malloc(DCT_SIZE * sizeof(element_t));
|
|
}
|
|
}
|
|
```
|
|
|
|
And the data is generated using the following code:
|
|
|
|
```c
|
|
for (long i = 0; i < TOTAL_DCT_BLOCKS; i++) {
|
|
for (int j = 0; j < DCT_SIZE; j++) {
|
|
for (int k = 0; k < DCT_SIZE; k++) {
|
|
mock_matrices[i][j][k] = j + k;
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### Naive
|
|
The first iteration of this code (see below) will use a naive implementation of DCT-II, along with no optimisations.
|
|
|
|
```c
|
|
void dct_2d(element_t** matrix_in, element_t** matrix_out) {
|
|
real_t cu, cv, sum;
|
|
int u, v, i, j;
|
|
|
|
for (u = 0; u < DCT_SIZE; u++) {
|
|
for (v = 0; v < DCT_SIZE; v++) {
|
|
cu = u == 0 ? 1 / sqrt(DCT_SIZE) : sqrt(2) / sqrt(DCT_SIZE);
|
|
cv = v == 0 ? 1 / sqrt(DCT_SIZE) : sqrt(2) / sqrt(DCT_SIZE);
|
|
|
|
sum = 0;
|
|
for (i = 0; i < DCT_SIZE; i++) {
|
|
for (j = 0; j < DCT_SIZE; j++) {
|
|
sum += matrix_in[i][j] * cos((2 * i + 1) * u * PI / (2 * DCT_SIZE)) * cos((2 * j + 1) * v * PI / (2 * DCT_SIZE));
|
|
}
|
|
}
|
|
matrix_out[u][v] = cu * cv * sum;
|
|
}
|
|
}
|
|
}
|
|
```
|
|
This version will serve as a baseline for further optimisations and after simulating this, it yielded a performance of 62977442 cycles.
|
|
|
|
### Software optimisations
|
|
## Compile time constants
|
|
Looking at the naive implementation we can see some low hanging fruit that can be easilty optimised by evaluating constants in compile time.
|
|
Firstly we can calculate the value of 1/sqrt(DCT_SIZE) and sqrt(2)/sqrt(DCT_SIZE) to avoid executing sqrt() in runtime which is a costly operation.
|
|
After doing this we get the following constants:
|
|
|
|
```c
|
|
#define INV_SQRTDCT_SIZE (real_t) 0.3535533906
|
|
#define SQRT2_INV_SQRTDCT (real_t) 0.5
|
|
```
|
|
|
|
Because PI / (2 * DCT_SIZE) is a constant we can calculate all possible cos() values from 0 to 32 with this multiple.
|
|
These values can then be stored in an array to eliminate runtime calculations. This is done in also done in compile time in the following way:
|
|
|
|
```c
|
|
#define DCT_COS_TABLE_SIZE 32
|
|
|
|
#define DCT_COS_TABLE (double[DCT_COS_TABLE_SIZE]) { \
|
|
1, 0.980785, 0.92388, 0.83147, 0.707107, 0.55557, 0.382683, \
|
|
0.19509, 0, -0.19509, -0.382683, -0.55557, -0.707107, -0.83147, \
|
|
-0.92388, -0.980785, -1, -0.980785, -0.92388, -0.83147, -0.707107, \
|
|
-0.55557, -0.382683, -0.19509, 0, 0.19509, 0.382683, 0.55557, \
|
|
0.707107, 0.83147, 0.92388, 0.980785 }
|
|
```
|
|
This changes the way the sum is calculated to the following:
|
|
```c
|
|
sum += matrix_in[i][j] * DCT_COS_TABLE[((2 * i + 1) * u) % DCT_COS_TABLE_SIZE *DCT_COS_TABLE[((2 * j + 1) * v) % DCT_COS_TABLE_SIZE];
|
|
```
|
|
|
|
After eliminating unnecessary calculations we can move some calculation to other loops to reduce redundant calculations.
|
|
These are found in the inner loops of the alogrithm where they should be recalculated for each other iteration,
|
|
but are instead recalculated in the inner leading to redudant operations.
|
|
```c
|
|
for (u = 0; u < DCT_SIZE; u++) {
|
|
for (v = 0; v < DCT_SIZE; v++) {
|
|
cu = u == 0 ? 1 / sqrt(DCT_SIZE) : sqrt(2) / sqrt(DCT_SIZE);
|
|
cv = v == 0 ? 1 / sqrt(DCT_SIZE) : sqrt(2) / sqrt(DCT_SIZE);
|
|
|
|
sum = 0;
|
|
for (i = 0; i < DCT_SIZE; i++) {
|
|
for (j = 0; j < DCT_SIZE; j++) {
|
|
sum += matrix_in[i][j] * DCT_COS_TABLE[((2 * i + 1) * u) % DCT_COS_TABLE_SIZE *DCT_COS_TABLE[((2 * j + 1) * v) % DCT_COS_TABLE_SIZE];
|
|
}
|
|
}
|
|
matrix_out[u][v] = cu * cv * sum;
|
|
}
|
|
}
|
|
```
|
|
The first step is to move the cu assignment to the outer loop, this will eliminate 7 redundant calculations of cu.
|
|
Secondly the sum calculation can be refactored to only lookup the cos for the u value in the outer loop and the v value in the inner loop.
|
|
By applying these changes we get the following code:
|
|
|
|
```c
|
|
for (u = 0; u < DCT_SIZE; u++) {
|
|
cu = u == 0 ? INV_SQRTDCT_SIZE : SQRT2_INV_SQRTDCT;
|
|
for (v = 0; v < DCT_SIZE; v++) {
|
|
cv = v == 0 ? INV_SQRTDCT_SIZE : SQRT2_INV_SQRTDCT;
|
|
sum = 0;
|
|
for (i = 0; i < DCT_SIZE; i++) {
|
|
cos_u = DCT_COS_TABLE[((2 * i + 1) * u) % DCT_COS_TABLE_SIZE];
|
|
for (j = 0; j < DCT_SIZE; j++) {
|
|
cos_v = DCT_COS_TABLE[((2 * j + 1) * v) % DCT_COS_TABLE_SIZE];
|
|
sum += matrix_in[i][j] * cos_u * cos_v;
|
|
}
|
|
}
|
|
matrix_out[u][v] = cu * cv * sum;
|
|
}
|
|
}
|
|
```
|
|
After running the changes in the simulation, the performance improved to 26965608 cycles.
|
|
|
|
## Flattening arrays
|
|
Flattening arrays is the process of storing a multidimensional array in a single dimension.
|
|
This creates a memory layout that is less jagged, leading to better cache performance and predictability.
|
|
It is also necessary for future implementation of vectorision and compiler optimisations.
|
|
|
|
The first step is to slightly change our data generations to now generate a one dimensional array.
|
|
The memory allocation, memory deallocation and data generation now looks like this:
|
|
|
|
```c
|
|
element_t** generate_mock_matrices() {
|
|
element_t **mock_matrices = (element_t **) malloc(TOTAL_DCT_BLOCKS * sizeof(element_t*));
|
|
for (int i = 0; i < TOTAL_DCT_BLOCKS; i++) {
|
|
mock_matrices[i] = (element_t *) malloc(DCT_SIZE * DCT_SIZE * sizeof(element_t));
|
|
}
|
|
|
|
populate_mock_matrices(mock_matrices);
|
|
return mock_matrices;
|
|
}
|
|
|
|
void free_mock_matrices(element_t** mock_matrices) {
|
|
for (int i = 0; i < TOTAL_DCT_BLOCKS; i++) {
|
|
free(mock_matrices[i]);
|
|
}
|
|
free(mock_matrices);
|
|
}
|
|
|
|
void populate_mock_matrices(element_t** mock_matrices) {
|
|
for (long i = 0; i < TOTAL_DCT_BLOCKS; i++) {
|
|
for (int j = 0; j < DCT_SIZE; j++) {
|
|
for (int k = 0; k < DCT_SIZE; k++) {
|
|
mock_matrices[i][j * DCT_SIZE + k] = j + k;
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
The next step is to change the signature of the kernel function and change the array accessing.
|
|
```c
|
|
void dct_2d(element_t* matrix_in, element_t* matrix_out) {
|
|
real_t cu, cv, sum, cos_u, cos_v;
|
|
int u, v, i, j;
|
|
|
|
for (u = 0; u < DCT_SIZE; u++) {
|
|
cu = u == 0 ? INV_SQRTDCT_SIZE : SQRT2_INV_SQRTDCT;
|
|
for (v = 0; v < DCT_SIZE; v++) {
|
|
cv = v == 0 ? INV_SQRTDCT_SIZE : SQRT2_INV_SQRTDCT;
|
|
sum = 0;
|
|
for (i = 0; i < DCT_SIZE; i++) {
|
|
cos_u = DCT_COS_TABLE[((2 * i + 1) * u) % DCT_COS_TABLE_SIZE];
|
|
for (j = 0; j < DCT_SIZE; j++) {
|
|
cos_v = DCT_COS_TABLE[((2 * j + 1) * v) % DCT_COS_TABLE_SIZE];
|
|
sum += matrix_in[i * DCT_SIZE + j] * cos_u * cos_v;
|
|
}
|
|
}
|
|
matrix_out[u * DCT_SIZE + v] = cu * cv * sum;
|
|
}
|
|
}
|
|
}
|
|
```
|
|
Not only does this enable further optimisations but the performance improved to 23667310 cycles.
|
|
## Vectorisation
|
|
## Changing data types
|
|
## Compiler optimisations
|
|
|
|
|