Lab: Introduction to Loop Transformations
This lab aims to provide a bit of experience with loop transformations
and give some feel of the challenges that the compiler writers encounter.
You will apply a few loop transformations to a simple loop nest and experiment
with various parameters associated with the transformations.
Setup
cd ~/EJCPLab/
You should find a file named syr2k.c. This is the starting point.
The code performs a dense linear algebra computation called symmetric rank 2k update.
Essentially, it is some multiplications and additions of elements in two matrices (A,B),
strored into another matrix (C).
This code contains the following functions:
 main: takes two inputs N M that define the size of the problem.
 kernel_syr2k: which is the main loop nest to optimize
 init_array: initializes array content
 print_array: dumps the content of computed array
You should only touch main and the kernel during this lab.
Task Outline
 Apply tiling with constant tile size.
 Apply tiling with parametric tile size.
 Try many different tile sizes and problem sizes to find "good" tile sizes.
 Apply loop permutation to take advantage of hardware prefetching.
 Compare performance of permuted version with tiled version.
 Parallelize the two versions using OpenMP pragmas.
 Try many different tile sizes for the parallelized tiled version.
 Compare performance of parallelized versions.
0. Setup Makefile
Create a Makefile to compile the syr2k.c
and make sure that it can be compiled and executed.
The bare minimum would be:
syr2k: syr2k.c
gcc O3 o $@ $<
but note that you will be adding more files and possibly different options later on.
Try running with a small input size:
./syr2k 100 100
It should dump the content of the C array to stderr,
and the execution time to stdout
You can save the output to file by the following:
./syr2k 100 100 2> out.orig.100x100
which will come handy later on to verify that you didn't break the code.
1. Apply tiling with constant tile size
 Copy syr2k.c to a new file named syr2ktiled.c.
This will be the file to be modified in this step.
 Update the Makefile to compile syr2ktiled
 Tile the 3D loop nest in the kernel function. Let the tile size be 8.

You first want to create a 3D loop nest that will visit the original iteration
space in a very sparse manner. This is called the tile loop, which
visits the lowerleft corner of each tile (called tile origin).
For example, if the original loop was the following:
for (i=0; i<N; i++)
for (j=0; j<N; j++)
...
The desired tile loop looks like this:
for (ti=0; ti<N; ti+=8)
for (tj=0; tj<N; tj+=8)
...
Note that I have changed the iterator names to tx to distinguish it from the original name.
Next, you will create another 3D loop nest that will visit all points in a tile, given a tile origin.
This is called the point loop, which visits a much smaller space than the original loop nest.
With the same example, the desired point loop looks like this:
for (i=ti; i<ti+8; i++)
for (j=tj; j<tj+8; j++)
...
The above visits an 8x8 region with (i,j) being the lowerleft corner of the visited space.

Now you can put the two loops together to get a tiled loop nest.
for (ti=0; ti<N; ti+=8)
for (tj=0; tj<N; tj+=8)
for (i=ti; i<ti+8; i++)
for (j=tj; j<tj+8; j++)
...
The above loop nest should visit the same set of iteration points as the
original loop, but in a different order. Apply the same sequence of
transformations to the 3D loop nest in syr2ktiled.c.

However, there is a bug in the point loops in the above. Try to identify and
fix the bug.
HINT: bug does not manifest when N is a multiple of 8.
 (Optional) As an additional thought exercise, think about applying tiling to more complex loop nests. What if the j loop used values of i like:
for (i=0; i<N; i++)
for (j=i; j<N; j++)
 Compile and test against the output of the original code.
(This step will not be explicitly stated from now on, but it should be done for any transformation.)
 Run with relatively large problem sizes 1000+ and compare performance.
2. Apply tiling with parametric tile size
 Copy syr2k.c to a new file named syr2kptiled.c.
This will be the file to be modified in this step.
 Update the Makefile to compile syr2kptiled
 Apply the same transformations to the loop nest as the previous step, except that you make the tile sizes parametric.
For this particular loop nest, you can simply replace all occurrences of 8 with some symbol. Use SI, SJ, SK as the tile size for each dimension.

Modify the function signature of the kernel to accept the tile size parameters,
and update the main to accept 3 additional parameters (SI, SJ,
SK) using argc/argv.
(If you are not familiar enough with C
to do this step, ask your neighbors or the instructor.)
3. Try many different tile sizes and problem sizes to find "good" tile sizes
With the parameterized version, it is much easier to explore the effect of tile
sizes. First check that running the parametric version with 8x8x8 tiles gives
similar performance as the constant version. Then try different tiles, make
sure to change the "shape" (i.e., use different size for each dimension), and
also play with problem sizes.
A recommended strategy for exploration is as follows:
 Select a problem size where N=M that takes more than 30 seconds with the
original code (syr2k.c).
 Run the tiled version with the same problem size with various tile sizes.
In this step, restrict to the case where SI=SJ=SK.
Try different sizes from 2x2x2 to some large tile size that performs much
poorly than smaller sizes.
 Plot the result and find where the optimal cubic tile size is.
 Explore noncubic tiles (when !(SI=SJ=SK)) to try and find a
combination of sizes that is better.
Try the above for two other problem sizes, one each for cases when N>M and N<M.
(Optional) The thought exercise here is to think about why certain tile sizes
perform better than others. In our case, it is likely tied to the behavior of
the Last Level Cache (or L1 Cache) of the processor. Although it is a bit
tricky through virtual machines, can you find some trend or a "rule of thumb"?
4. Apply loop permutation to take advantage of hardware prefetching
 Copy syr2k.c to a new file named syr2kperm.c.
This will be the file to be modified in this step.
 Update the Makefile to compile syr2kperm
 Permute the 3D loop nest in the kernel function.
Recall that the references are friendly to the HW prefetcher
if the innermost loop dimension of the loop do not traverse different rows of a matrix.
 See if you get performance improvement. For most machines today, it should, so it is likely a wrong permutation if it doesn't.
5. Compare performance of permuted version with tiled version
How do they compare? The expectation is that the tiled version is slightly worse.
6. Parallelize the two versions using OpenMP pragmas
For this lab, how the parallelization works and details of the OpenMP is outside its scope.
The two versions (ptiled/perm) are extremely easy to parallelize, and does not even require tiling to parallelize.
However, different transformations still impact parallel performance. Apply the following changes to parallelize your code.
 Copy syr2kptiled.c to a new file named syr2kptiledomp.c.
 Add the following line to the line just above the outermost loop in the kernel function.
#pragma omp parallel for private(ti,tj,tk,i,j,k)
(You may need to adjust the loop iterator names to match your code.)
 Copy syr2kperm.c to a new file named syr2kpermomp.c.
 Add the following line to the line just above the outermost loop in the kernel function.
#pragma omp parallel for private(i,j,k)
(You may need to adjust the loop iterator names to match your code.)

You also need to compile with two additional options:
gcc O3 o $@ $< fopenmp lgomp

Now both of your code should use 2 cores when you execute them. (Because the VM is configured to only have 2 cores.)
7. Try many different tile sizes for the parallelized tiled version
Repeat what you did in Step 3 with the parallel versions using 2 cores.
8. Compare performance of parallelized versions
What did you observe?
Extra Exercises
If you are done and are hungry for more exercises, there are two additional
code in extra directory.
 trmm: triangular matrix multiply. Tricky parts are the
triangular iteration domain, and dependences due to reusing arrays for inputs
and outputs.
 mvt: two matrix vector products, where one is transposed. Tricky part is that you need to fuse the two loop nests to tile the entire computation.