Maximizing Shared Memory Bandwidth on NVIDIA Kepler GPUs

Shared Memory Configurations

On NVIDIA Kepler (Compute 3.x) GPUs, shared memory has 32 banks, with each bank having a bandwidth of 64-bits per clock cycle. On Fermi GPUs (Compute 2.x) shared memory also has 32 banks, but the bandwidth per bank is only 32-bits per clock cycle.

In Fermi GPUs, successive 32-bit words are assigned to successive banks. Kepler GPUs are configurable where either successive 32-bit words OR 64-bit words are assigned to successive banks. You can specify your desired configuration by calling cudaDeviceSharedMemConfig() from the host prior to launching your kernel, and specifying one of cudaSharedMemBankSizeDefault, cudaSharedMemBankSizeFourByte, or cudaSharedMemBankSizeEightByte.

In the eight byte configuration, bank conflicts occur when two or more threads in a warp request different 64-bit words from the same bank. In the four byte configuration, bank conflicts occur if two or more threads in a warp access 32-bit words from the same bank where those words span multiple 64-word aligned segments (Figure 1).

Figure 1 - Bank Conflicts in the Four Byte Configuration

Performance Implications

For kernels operating on double precision values, the eight byte configuration on Kepler allows loads/stores to consecutive doubles without bank conflicts. This is an improvement over Fermi GPUs, where accessing double precision values always incurred bank conflicts. If your kernels are operating on single precision values, you are only utilizing half the available shared memory bandwidth. Modifying your kernels to operate on 64-bit values in shared memory (perhaps using float2) may improve performance if shared memory throughput is a performance bottleneck for your kernel.

Shared Memory Example

I’ve previously discussed a 21-point 1D filter kernel. The kernel is written such that each thread computes a single output value. A thread block with N threads loads N+20 values into a shared memory array (inputS) so that all N threads can read their 21 input values without redundant reads from global memory. The filter calculation looks like the following:


filtered[gIdx] = inputS[sIdx - 10] * FilterCoeffs[ 0] +
       inputS[sIdx - 9] * FilterCoeffs[ 1] +
       inputS[sIdx - 8] * FilterCoeffs[ 2] +

       inputS[sIdx - 1] * FilterCoeffs[ 9] +
       inputS[sIdx] * FilterCoeffs[10] +
       inputS[sIdx + 1] * FilterCoeffs[11] +

       inputS[sIdx + 10] * FilterCoeffs[20];

The profiler shows that compute and memory utilization is balanced (Figure 2).



Figure 2 - Compute, Bandwidth or Latency Bound?

The compute workload is dominated by shared memory load/stores (Figure 3), and shared memory bandwidth is a bottleneck (Figure 4).



Figure 3 - Compute Resources



Figure 4 - Memory Bandwidth

Optimizing the Kernel

If we could read/write 2 floats at a time from shared memory, we could halve the number of load/store operations. We can accomplish this by rewriting our kernel so that each thread computes two filtered output elements, separated by blockDim.x*gridDim.x elements. We need half as many thread blocks as before. Each thread block needs to store 2N + 40 input elements in shared memory. To facilitate conflict-free access, we want to pack elements separated by blockDim.x*gridDim.x into the same float2 value. The filter computation then becomes:


float2 sum = {0.0f, 0.0f};
#pragma unroll
for(int i = 0; i < 21; i++)
{
       float2 temp = inputS[sIdxShift – 10 + i];
       sum.x += temp.x*FilterCoeffs[i];
       sum.y += temp.y*FilterCoeffs[i];
}
filtered[gIdx] = sum.x;
filtered[gIdx + blockDim.x*gridDim.x] = sum.y;

Performance Results

The kernels were compiled for Compute 3.5 architectures using the CUDA 5.5 compiler. I benchmarked the kernels using a 224 point input data set on a Tesla K20c GPU with the 331.20 Linux drivers. ECC was disabled for the test. The results are shown in Table 1.

Rewriting the kernel but keeping the shared memory configuration in four byte mode stills shows some improvement over the original. There are some economies of scale in the new kernel – the index calculations are amortized over two output calculations, and we have twice as many independent global memory loads/stores per thread, which may have better latency hiding characteristics. However, most of our shared memory accesses suffer from bank conflicts. This is for very subtle reasons. The shared memory accesses are to consecutive float2 elements, but they don’t fall in a 64-word aligned segment, so the accesses cause bank conflicts. Switching to the eight-byte mode eliminates the bank conflicts and further improves performance!

Kernel Execution Time
32-bit Shared Memory 2.1387 (ms)
64-bit Shared Memory (cudaSharedMemBankSizeFourByte) 1.78614 (ms)
64-bit Shared Memory (cudaSharedMemBankSizeEightByte) 1.33753 (ms)

Table 1 - Performance results