Systematic Gaming

November 23, 2008

Patterns in Performance – Cache Manipulation

Filed under: optimization — Tags: , — systematicgaming @ 2:37 am

With modern processors greatly outpacing the speed of memory we need to properly utilize the memory cache to achieve high performance.  Whenever we touch a memory address, the memory is pulled into the cache from main memory.  This is a slow operation, taking hundreds of cycles on modern CPUs.

To address this problem we’ll look at another pattern in performance: cache manipulation.  Instead of waiting for the memory controller to fetch memory when at the time we access it, we tell the memory controller to fetch the memory in advance.  By the time we need the memory it has already been put into the cache.

[EDIT: Updated to more correctly show load-wait timings.  Loads cause stalls when the loaded data is first accessed, not when the load instruction is invoked (which is what was previously implied).]

Lets look at a simple loop to illustrate this effect.

Matrix a[n], b[n];
for (int i = 0; i < n; i++ )
{
   a[i] *= b[i];
}

If we take a closer look at the inside of the loop, we see how reading memory affects our performance.

caching_loop2

These waits are total performance killers.  On modern CPUs you can spend dozens to hundreds of cycles waiting for memory that’s not already in the L1 cache.  Since we’re iterating through two arrays there isn’t much re-usage of memory, so we don’t make great use of the cache. We pretty much guarantee an couple of L1 cache miss every iteration or two, depending on the size of the L1 cache.

We’ll be looking at performance patterns that help us reduce or even eliminate memory stalls in our code.

Unfortunately most cache access instructions have various alignment restrictions and fetch different amounts of memory depending on the target architecture.  Some architectures also allow you to manipulate the cache at higher levels (L2 or higher), which can be even more efficient.  However, the general patterns of cache manipulation remain the same in most cases.  To keep the samples simple, I’m going to assume we have cache manipulation methods that work on a single cache line basis.

Pattern: Data Prefetch

Our first pattern is the most basic – we simply request memory before we need it hiding the latency.

Matrix a[n], b[n];
for (int i = 0; i < n - 1; i++ )
{
   prefetch( &a[i + 1] );
   prefetch( &b[i + 1] );
   a[i] *= b[i];
}
a[n - 1] *= b[n - 1];

Now we’re requesting memory for iteration i+1 during iteration i.  The idea is that by the time we’re done calculating a[i] *= b[i], the memory for a[i+1] and b[i+1] are in the cache and ready for use.  Now we’ll still might stall when we first try and access a and b, but more importantly we need to be careful about how we end the loop.  Our execution should now look something like this:

caching_prefetch21

The wait may or may not be completely removed.  The prefetch instruction will cause the memory to be requested earlier, but it still may not have finished by the time the next loop iteration is executed.  The total wait time should be reduced by at least the time taken by the calculation.  Sometimes the wait cannot be reduced much, even with prefetching, this is especially the case when the operation on the data is light – such as summing two arrays.  In this case we usually call the code memory bound, since performance is determined by how fast we can access memory.  In the opposite case we call the code instruction bound.

Our array is valid for length n, what happens if we prefetch past that address?  It pretty much depends on our architecture – it may simply be ignored, or it may crash with an invalid address exception.  So to be on the safe side we should avoid fetching addresses we’re unsure of.  We also need to be careful that our data is aligned, and fits in a cache line – or we may need to prefetch more than one cache line at a time

Pattern: Write-only Data

A generating output from one or more inputs is one of the most common programming patterns – its pretty much the primary purpose of computers.  However, when the output buffer is never read from – a fairly common case – standard cache usage is quite inefficient.  Lets look at a simple example:

Matrix a[n], b[n];
for (int i = 0; i < n; i++ )
{
   a[i] = b[i] * v;
}

Here’s how the cache behaves in this case:

caching_write

The data for a[] is read into the cache, just to be overwritten.  This is a big waste of memory bandwidth.  So why do we even read in a[] in the first place?  Basically whenever we touch memory the cache is filled from main memory.  The way to prevent this read is to fill the cache with the address of a[], but not read any data.  For example POWER architecture Data Cache Line Set to Zero (dclz) instruction which sets the cache line for a given address to zero – putting the address in the cache and setting it to zero directly.

Matrix a[n], b[n];
for (int i = 0; i < n; i++ )
{
   clearcache(&a[i]);
   a[i] = b[i] * v;
}

This prevents the memory at a[] from being read, halving the amount of data read.  Our cache usage should now look something like this:

caching_write_zero

Pretty similar, but instead of loading address a[] we put it into the cache directly, eliminating the unnecessary memory read.  We’ve also eliminated any memory stall that may have been caused waiting for a[] to be read into the cache.

This assumes that a != b, and a and b don’t overlap.  The C/C++ restrict keyword is meant to inform the compiler of this requirement, however there is still no way to declare a write-only buffer.  A compiler should be able to not issue load instructions for a[], but sometimes they sill do, even when using the restrict keyword.  I’m unaware of any compiler smart enough to insert cache clearing instructions – due to alignment and size restrictions it may not even be possible in C/C++.

Of course we can also combine this pattern with data prefetching to read b[] and remove stalls in combination with removing unnecessary reads.

Summary

Caching patterns can be usually applied whenever future memory access is predictable.  Our samples use arrays for simplicity, but the data prefetch and write-only data patterns are applicable to any data structure (lists, trees, etc) that can predict future memory access.

Additionally there are a whole bunch of optimizations possible with the cache, however they are often dependent on the underlying hardware architecture.  The above patterns are usually possible in some form on most modern CPUs and console hardware.

Further Reading

GCC documentation describing prefetch mechanisms on various architectures.

A detailed article on memory for programmers, part of a series of articles on memory.

IBM documentation for the POWER series of CPUs describing how to zero cachelines.

Advertisements

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: