EJCP 2017 Lab: Introduction to Loop Transformations

Lab: Introduction to Loop Transformations

This page summarizes the key messages that I wanted to covey through the lab. It is designed for someone who were unable to complete it during EJCP, or those who found this page later on and are working on it on your own.

Completed Codes

You can find the completed source files here.

Objectives of the Lab

Apply tiling with constant tile size.

This step is to make sure you can apply tiling to a simple example. There are some subtle points where you can make mistakes, and it is best to start with a simple case with constant tile size. I selected 8x8x8 tile size because it will probably give significant performance improvement, but will not be anywhere near optimal in most processors these days.
Apply tiling with parametric tile size.

Now that you are familiar with tiling this program, making the tile size parametric is an important precursor to the next step. Imagine performing all the explorations in the subsequent steps without parameterized code!
Try many different tile sizes and problem sizes to find "good" tile sizes.

In this step, you should find how sensitive the performance of tiled code is to tile sizes. The optimal tile size is different for each of your machine and for different problem sizes. Although the performance with respect to the tile size usually are continuous, there are often some spikes that are difficult to explain. In addition, many people in the optimization world find this part very "fun", and I would be very glad if I was able to share some of it.
As for the plotting, what I expect for many architecture is as follows:
- With very small tile sizes, the performance is really bad, close to the original.
- Once the tile sizes become large enough (the value depends on the platform), you will see drastic improvements in performance.
- As you keep increasing the tile size, the performance should gradually improve.
- However, at some larger tile size, the performance will drop all the way back to the level of original non-tiled code. This is because the tile size is too large, so that the memory footprint of a tile no longer fits in cache.
(the above is a simplified case when SI=SJ=SK).

Correctly predicting the right tile size is very difficult as it depends a lot on both the program and the architecture. See Wikipedia entry on CPU Cache, especially about cache misses, or search for "cache conflict miss", "cache interference miss" to find materials about them.

Apply loop permutation to take advantage of hardware prefetching.

This step is to go back to the permutation example that I showed in the beginning of the lecture. For most modern processors, it should give significant performance improvement. Because the HW prefetcher is able to hide almost all cache miss penalties for this code, the performance you get is very close to what you get by tiling and finding the optimal tile size.

Compare performance of permuted version with tiled version.

The expectation was that you are unable to beat the performance of the permuted code, since finding the optimal tile size is really difficult, and also because tiling adds some overhead.

Parallelize the two versions using OpenMP pragmas.
If I stopped the lab at the previous step, some of you may get the impression that tiling is useless. This misconception can easily be avoided by throwing in another complication: parallelism.
Try many different tile sizes for the parallelized tiled version.
Adding parallelism into the mix further complicates the process of tile size selection. There are certain levels of cache that are shared by the cores, adding more sources of interference. Now the tile sizes directly corresponds to the granularity of parallelization, influencing load balancing. Finding the good set of tile sizes for varying number of threads is really challenging.
Compare performance of parallelized versions.
The code with loop permutation should not get much better (or even get worse) after parallelization. This is because the permutation transformation is not really making the data locality better, and is only hiding the penalty of cache misses through prefetching. By adding more threads to performce the computation, the prefetcher won't be able to keep up with the amount of memory requests, and becomes incapable of hiding the latencies.

Tiling is actually improving the temporal locality, and reducing bandwidth requirement. Therefore, it scales very well (with large enough problem size), and will surpass the performance of permuted code in the parallel setting.

Although I had recommended to set OMP_NUM_THREADS to 2 during EJCP to keep things from becoming too complicated, you will find the difference between permutation and tiling much more clearly when using more threads and larger problem sizes.