Constant Cache vs. Read-Only Cache - Part 1

The Checkered Flag

During a recent webinar one of our attendees asked us the question:
“So which is faster, the constant cache or the read-only data cache?”

The constant cache and the read-only cache are intended for quite different purposes. Let’s briefly consider a Ferrari 458 Speciale and Dodge Grand Caravan instead. If you need to post a fast lap time at Nürburgring, take the Ferrari. However, the Dodge is a better choice if you need to get 6 passengers and their luggage to a weekend getaway destination.

With that in mind, let’s take a closer look at the constant cache and the read-only data cache.

Constant Memory

Constant memory has been available to CUDA developers since the very first generation of CUDA capable GPUs (Compute Capability 1.0). Data in constant memory:

  • Resides in a 64KB partition of device memory
  • Is accessed through an 8KB cache on each SM(X)
  • Is intended to be broadcast to all threads in a warp

The last point is important. If all the threads in the warp request the same value, that value is delivered to all threads in a single cycle. If the threads in a warp request N different values, the request is serialized and the N values are delivered one at a time over N clock cycles. You want to ensure your indexes into constant memory arrays are not functions of threadIdx.x.

Read-Only Data Cache

The read-only data cache was introduced with Compute Capability 3.5 architectures (e.g. Tesla K20c/K20X and GeForce GTX Titan/780 GPUs). Similar functionality has been available since Compute Capability 1.0 devices, although you needed to use the somewhat unconventional texture path to take advantage of it.

Each SMX has a 48KB read-only cache. The CUDA compiler automatically accesses data via the read-only cache when it can determine that data is read-only for the lifetime of your kernel. In practice, you need to qualify pointers as const and __restrict__ before the compiler can satisfy this condition. You can also specify a read-only data cache access with the __ldg() intrinsic function.

And Now We Race!

I benchmarked 2 versions of the following 21-point 1D filter operation:

filtered[gIdx] = inputS[sIdx - 10] * FilterCoeffs[ 0] +
                 inputS[sIdx - 9] * FilterCoeffs[ 1] +
                 inputS[sIdx - 8] * FilterCoeffs[ 2] +

                 inputS[sIdx - 1] * FilterCoeffs[ 9] +
                 inputS[sIdx] * FilterCoeffs[10] +
                 inputS[sIdx + 1] * FilterCoeffs[11] +

                 inputS[sIdx + 10] * FilterCoeffs[20];

Prior to applying the filter, the input array is copied to the shared memory array inputS.

For the constant memory version, FilterCoeffs is declared as follows:

__constant__ float FilterCoeffs[21] = {0.005f, 0.01f, … 0.01f, 0.005f};
__global__ void ConstantFilter(…);

For the read-only cache version, the filter coefficients are allocated in global memory using cudaMalloc(). The pointer argument is qualified as const and __restrict__ as follows:
__global__ void ReadOnlyFilter(…, float const* __restrict__ FilterCoeffs)

And the Winner Is?

The kernels were compiled for Compute 3.5 architectures using the CUDA 5.5 compiler. I benchmarked the kernels using a 224 point input data set on a Tesla K20c GPU with 319.32 drivers. ECC was disabled for the test.

Constant Memory 2.33754 (ms)
Read-only Cache 3.96591 (ms)

Is that what you expected? Our accesses to FilterCoeffs are ideal for constant memory. We have a small, read-only dataset, and we are broadcasting the same coefficient value to each thread in our warp. Constant memory should be very efficient in this situation.

The significantly improved CUDA 5.5 Visual Profiler provides addition insight.

In the constant memory kernel, the profiler indicates that compute and memory utilization is mostly balanced. The profiler is showing us the utilization of the busiest compute function unit (load/store unit) and the busiest memory resource (L1/shared memory).

Acceleware Blog - Constant Bottleneck

The compute performance is bottlenecked by high utilization of the load/store unit which in this kernel is used to access our input data from shared memory.

Acceleware Blog - Constant Compute Analysis

However, we have yet to saturate shared memory bandwidth.

The arithmetic function units can access the coefficients directly from constant memory. In that sense, accessing our filter coefficients is ‘free’. Our performance is currently limited by the number of shared memory load instructions, even though we have not saturated the available shared memory bandwidth. This suggest we could probably further improve our performance by using the wider 64-bit shared memory banks in Compute 3.x GPUs to load 64-bit values from shared memory with the same number of load/store instructions.

In contrast, the read-only kernel has much higher memory utilization than compute utilization.

Acceleware Blog - Read Only Bottleneck

It is memory limited. The texture cache bandwidth is high, but does not approach the device limit.

Latency analysis suggests that texture stalls are limiting performance.

Acceleware Blog - Read-only Latency

So, in this race the extremely low-latency of a constant memory broadcast (Ferrari) compared to the higher latency of the read-only data cache (minivan) is the deciding factor. Can you think of a scenario where the read-only data cache would be a better choice than the constant memory cache?

Comments

It would be nice to also show a case where the read-only cache outperforms constant memory. For example, strided access to a constant array (small stride), where the accesses within a warp are never uniform.

Thanks for the comment, mharris! We'll keep that in mind for a future blog post.