EJCP 2017 Lab: Introduction to Loop Transformations

Lab: Introduction to Loop Transformations

This lab aims to provide a bit of experience with loop transformations and give some feel of the challenges that the compiler writers encounter. You will apply a few loop transformations to a simple loop nest and experiment with various parameters associated with the transformations.

Setup

cd ~/EJCP-Lab/

You should find a file named syr2k.c. This is the starting point. The code performs a dense linear algebra computation called symmetric rank 2k update. Essentially, it is some multiplications and additions of elements in two matrices (A,B), strored into another matrix (C).

This code contains the following functions:

main: takes two inputs N M that define the size of the problem.
kernel_syr2k: which is the main loop nest to optimize
init_array: initializes array content
print_array: dumps the content of computed array

You should only touch main and the kernel during this lab.

Task Outline

Apply tiling with constant tile size.
Apply tiling with parametric tile size.
Try many different tile sizes and problem sizes to find "good" tile sizes.
Apply loop permutation to take advantage of hardware prefetching.
Compare performance of permuted version with tiled version.
Parallelize the two versions using OpenMP pragmas.
Try many different tile sizes for the parallelized tiled version.
Compare performance of parallelized versions.

0. Setup Makefile

Create a Makefile to compile the syr2k.c and make sure that it can be compiled and executed.

The bare minimum would be:

syr2k: syr2k.c
	gcc -O3 -o $@ $<

but note that you will be adding more files and possibly different options later on.

Try running with a small input size:
./syr2k 100 100

It should dump the content of the C array to stderr, and the execution time to stdout

You can save the output to file by the following:
./syr2k 100 100 2> out.orig.100x100

which will come handy later on to verify that you didn't break the code.

1. Apply tiling with constant tile size

Copy syr2k.c to a new file named syr2k-tiled.c. This will be the file to be modified in this step.
Update the Makefile to compile syr2k-tiled
Tile the 3D loop nest in the kernel function. Let the tile size be 8.
1. You first want to create a 3D loop nest that will visit the original iteration space in a very sparse manner. This is called the tile loop, which visits the lower-left corner of each tile (called tile origin). For example, if the original loop was the following:
```
for (i=0; i<N; i++)
   for (j=0; j<N; j++)
		...
```
  The desired tile loop looks like this:
```
for (ti=0; ti<N; ti+=8)
   for (tj=0; tj<N; tj+=8)
		...
```
  Note that I have changed the iterator names to tx to distinguish it from the original name.
2. Now you can put the two loops together to get a tiled loop nest.
```
for (ti=0; ti<N; ti+=8)
  for (tj=0; tj<N; tj+=8)
    for (i=ti; i<ti+8; i++)
      for (j=tj; j<tj+8; j++)
        ...
```
  The above loop nest should visit the same set of iteration points as the original loop, but in a different order. Apply the same sequence of transformations to the 3D loop nest in syr2k-tiled.c.
3. However, there is a bug in the point loops in the above. Try to identify and fix the bug.
  
  HINT: bug does not manifest when N is a multiple of 8.
4. (Optional) As an additional thought exercise, think about applying tiling to more complex loop nests. What if the j loop used values of i like:
```
for (i=0; i<N; i++)
   for (j=i; j<N; j++)
```
Compile and test against the output of the original code. (This step will not be explicitly stated from now on, but it should be done for any transformation.)
Run with relatively large problem sizes 1000+ and compare performance.

2. Apply tiling with parametric tile size

Copy syr2k.c to a new file named syr2k-ptiled.c. This will be the file to be modified in this step.
Update the Makefile to compile syr2k-ptiled
Apply the same transformations to the loop nest as the previous step, except that you make the tile sizes parametric. For this particular loop nest, you can simply replace all occurrences of 8 with some symbol. Use SI, SJ, SK as the tile size for each dimension.
Modify the function signature of the kernel to accept the tile size parameters, and update the main to accept 3 additional parameters (SI, SJ, SK) using argc/argv.
(If you are not familiar enough with C to do this step, ask your neighbors or the instructor.)

3. Try many different tile sizes and problem sizes to find "good" tile sizes

With the parameterized version, it is much easier to explore the effect of tile sizes. First check that running the parametric version with 8x8x8 tiles gives similar performance as the constant version. Then try different tiles, make sure to change the "shape" (i.e., use different size for each dimension), and also play with problem sizes.

A recommended strategy for exploration is as follows:

Select a problem size where N=M that takes more than 30 seconds with the original code (syr2k.c).
Run the tiled version with the same problem size with various tile sizes. In this step, restrict to the case where SI=SJ=SK. Try different sizes from 2x2x2 to some large tile size that performs much poorly than smaller sizes.
Plot the result and find where the optimal cubic tile size is.
Explore non-cubic tiles (when !(SI=SJ=SK)) to try and find a combination of sizes that is better.

Try the above for two other problem sizes, one each for cases when N>M and N<M.

(Optional) The thought exercise here is to think about why certain tile sizes perform better than others. In our case, it is likely tied to the behavior of the Last Level Cache (or L1 Cache) of the processor. Although it is a bit tricky through virtual machines, can you find some trend or a "rule of thumb"?

4. Apply loop permutation to take advantage of hardware prefetching

Copy syr2k.c to a new file named syr2k-perm.c. This will be the file to be modified in this step.
Update the Makefile to compile syr2k-perm
Permute the 3D loop nest in the kernel function. Recall that the references are friendly to the HW prefetcher if the innermost loop dimension of the loop do not traverse different rows of a matrix.
See if you get performance improvement. For most machines today, it should, so it is likely a wrong permutation if it doesn't.

5. Compare performance of permuted version with tiled version

How do they compare? The expectation is that the tiled version is slightly worse.

6. Parallelize the two versions using OpenMP pragmas

For this lab, how the parallelization works and details of the OpenMP is outside its scope. The two versions (ptiled/perm) are extremely easy to parallelize, and does not even require tiling to parallelize. However, different transformations still impact parallel performance. Apply the following changes to parallelize your code.

Copy syr2k-ptiled.c to a new file named syr2k-ptiled-omp.c.
Add the following line to the line just above the outermost loop in the kernel function.
```
#pragma omp parallel for private(ti,tj,tk,i,j,k)
```
(You may need to adjust the loop iterator names to match your code.)
Copy syr2k-perm.c to a new file named syr2k-perm-omp.c.
Add the following line to the line just above the outermost loop in the kernel function.
```
#pragma omp parallel for private(i,j,k)
```
(You may need to adjust the loop iterator names to match your code.)
You also need to compile with two additional options:
```
	gcc -O3 -o $@ $< -fopenmp -lgomp
```
Now both of your code should use 2 cores when you execute them. (Because the VM is configured to only have 2 cores.)

7. Try many different tile sizes for the parallelized tiled version

Repeat what you did in Step 3 with the parallel versions using 2 cores.

8. Compare performance of parallelized versions

What did you observe?

Extra Exercises

If you are done and are hungry for more exercises, there are two additional code in extra directory.

trmm: triangular matrix multiply. Tricky parts are the triangular iteration domain, and dependences due to reusing arrays for inputs and outputs.
mvt: two matrix vector products, where one is transposed. Tricky part is that you need to fuse the two loop nests to tile the entire computation.