Systematic Gaming

December 23, 2008

Patterns in Performance: Cache Pressure

Filed under: game programming, optimization — Tags: , — systematicgaming @ 3:16 am

I’ve been mentioning memory and the cache a lot, because proper cache utilization really is a critical to good performance. This time we’ll look at cache pressure – a term that refers to overworking the memory cache with too many or wasteful memory accesses.  Basically putting pressure on the memory cache means you’re wasting time accessing more memory than you need to.  The main cause is inefficient data access, often by using data structures that are too large, or too sparse reducing any benefit from memory locality.

Let’s look at a common example – sorting.  Sorting is a computer science staple, and is very common in the real world.  A common usage in games is sorting draw calls.  There are a number of reasons to sort draws: front-to-back to optimize for Z rejection, back-to-front for alpha sorting and by shader or material to reduce state changes.

A naive implementation might look something like this:

class DrawCall
{
   int sortKey;
   bool operator< (const DrawCall &dc) { return sortKey < dc.sortKey; }

   int shaderIndex;
   Mesh *mesh;
   RenderState state;
   Color color;
   Matrix baseMatrix;
   // etc...
};

Basically we have a class that stores the information we need to draw a mesh and it’s sorting index or key.  We can simply sort our list of visible objects like so:

std::sort(drawCallList, drawCallList + numDrawCalls);

for (int i  = 0;  i < numDrawCalls; i++)
{
   drawCallList[i]->Draw();
}

Simple right? Yep.  Fast? Well, let’s look at some numbers.

Here’s the time in milliseconds it took to sort a similar 512 byte structure, for just up to just over a million items.

cache_pressure_5121

We don’t sort millions of items each frame in a normal game.  We do however often sort hundreds or a few thousand of draw calls in a single frame.  Our chart shows this will take 2-10ms to sort that many items.  That is a big chunk of time, and really is too slow to be useful.  We’re putting too much pressure on the cache. Or at least that’s what I claim.

Let’s take a more scientific approach and do an experiment or two.  My hypothesis is that our 512-byte object is too large and is putting too much pressure on the cache.  So by reducing the size of the object we’ll reduce the time required to sort.  Here are the results of sorting items of various size:

cache_pressure_sizes

Ok, that’s a pretty clear picture.  The larger the object we’re sorting the longer it takes, as predicted. But what’s really important is the impact the size has on performance – 4 byte objects (well, 4 bytes + a 4 byte sorting key) sort almost two magnitudes faster than 512 byte objects.

The reason for this is pretty simple.  At the center of any comparison based sort (like the std::sort above) we’ll have a line like this:

if (item[i] < item[j])
   swap(item[i], item[j]);

That means for each comparison that swaps we’re copying two large objects structures, touching over hundreds of bytes of memory.  Those copies are putting a lot of pressure on the cache.  Especially since we only really need the sort key to order our DrawCall objects, and it’s only 4 bytes.  We have a lot of wasted memory access.

An improved method sorts a second indirect list of smaller structures than can then be used to access or sort the original list of larger objects.

struct IndirectObject
{
   int sortKey;
   int objectIndex;

   bool operator< (const DrawCall &dc) { return sortKey < dc.sortKey; }
}

We can initialize a list of these indirect objects from the original list, sort them, and then use the sorted indirect objects to access our larger objects in sorted order.

for (int i  = 0;  i < numDrawCalls; i++)
{
   indirectList[i].sortKey = drawCallList[i].sortKey;
   indirectList[i].sortKey = i;
}

std::sort(indirectList, indirectList + numDrawCalls);

for (int i  = 0;  i < numDrawCalls; i++)
{
   int index = indirectList[i].index;
   drawCallList[index]->Draw();
}

So, just how much faster is this?  Let’s take a look:

cache_pressure_indirect

Again we’re looking a order of magnitude in speed difference.  The indirect line represents sorting the indirect array and then copying the draw calls into sorted order – so the end result is identical to the directly sorted array.  The no-copy line represents the time to sort the indirect array.  So if you only need to access your items in sorted order you’re probably best sorting an indirect array accessing your list indirectly.  This also has the advantage that you can keep sorted arrays around for different purposes – front to back, by material, etc., with out having to re-sort all the time or have multiple copies of the list.

Here’s the code if you want to run the tests yourself.

I think this is one of the clearest examples you’ll see on the impact of memory on performance.  This technique can be applied in a number of data structures to reduce the pressure on the cache. and as demonstrated can significantly affect performance.

Advertisements

1 Comment »

  1. […] says something about the difference in thought process required. Interestingly his original post (here) was about avoiding sorting large structures like this because it's a fairly stupid thing to do. I […]

    Pingback by C# vs C++ performance — May 31, 2011 @ 8:48 am


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: