Blogs

NVIDIA CUDA 6.0 Unified Memory Performance

One of the new features introduced by NVIDIA in CUDA 6.0 is Unified Memory, which simplifies GPU code, while also maximizing data access speed by transparently managing memory between the CPU and GPU. Based on the examples provided by NVIDIA in the CUDA C Programming Guide, it’s easy to see how Unified Memory simplifies code, however the actual performance is a bit of a mystery. Time to do some investigation!

We begin the performance test by using the sample code provided in NVIDIA’s CUDA C Programming Guide (Section J.1.1)

 

__global__ void AplusB( int *ret, int a, int b) {
   ret[threadIdx.x] = a + b + threadIdx.x;
}
   
int main() {
   int *ret;
   cudaMalloc(&ret, 1000 * sizeof(int));
   
   AplusB<<< 1, 1000 >>>(ret, 10, 100);
   
   int *host_ret = (int *)malloc(1000 * sizeof(int));
   cudaMemcpy(host_ret, ret, 1000 * sizeof(int), cudaMemcpyDefault);
   
   for(int i=0; i<1000; i++)
      printf("%d: A+B = %d\n", i, host_ret[i]);
   
   free(host_ret);
   cudaFree(ret);
   return 0;
}

Listing 1: Original code with explicit memory transfers

 

__global__ void AplusB( int *ret, int a, int b) {
   ret[threadIdx.x] = a + b + threadIdx.x;
}
   
int main() {
   int *ret;
   cudaMallocManaged(&ret, 1000 * sizeof(int));
   
   AplusB<<< 1, 1000 >>>(ret, 10, 100);
   cudaDeviceSynchronize();
   
   for(int i=0; i<1000; i++)
      printf("%d: A+B = %d\n", i, ret[i]);
   cudaFree(ret);
   return 0;
}

Listing 2: Unified Memory version

AMD FirePro W9100 and OpenCL Updates

AMD FirePro W9100

Last week AMD announced the FirePro W9100. This new professional workstation GPU compares favorably to NVIDIA's Tesla K40:

  AMD FirePro W9100 NVIDIA Tesla K40
(745MHz default clock)
NVIDIA Tesla K40
(845MHz clock with GPUBoost)
Memory (GB) 16 12 12
Peak Single Precision
Throughput (TFLOPS)
5.24 4.29 5.04
Peak Double Precision
Throughput (TFLOPS)
2.62 1.43 1.68

Webinar: Accelerating FWI via OpenCL on AMD GPUs

Join Chris Mason, Acceleware Product Manager, as he presents a case study of accelerating a seismic algorithm on a cluster of AMD GPU compute nodes, for geophysical software provider and processor GeoTomo. The presentation will begin with an outline of the full waveform inversion (FWI) algorithm, followed by an introduction to OpenCL. The OpenCL programming model and memory spaces will be introduced. After a short programming example, Chris takes you step-by-step through the project phases of profiling, feasibility analysis and implementation. Chris shares the strategy for formulating the problem to take advantage of the massively parallel GPU architecture. Key optimizations techniques are discussed including coalescing and an iterative approach to handle the slices. Performance results for the GPU are compared to the CPU run times.

 

GPU Boost on NVIDIA’s Tesla K40 GPUs

What is GPU Boost?

GPU Boost is a new user controllable feature to change the processor clock speed on the Tesla K40 GPU. NVIDIA is currently supporting 4 selectable Stream Processor clock speeds and two selectable Memory Clock Speeds on the K40.  The base clock for the Stream Processors is 745MHz and the three selectable options are 666 MHz, 810 MHz and 875MHz (finally, a manufacturer not afraid of superstition!). The base Memory Clock frequencies are 3004MHz (default) and 324MHz (idle). Only the effects of tuning the Stream Processor clock are discussed as there is no application performance increase that results from adjusting the Memory Clock. This blog shows the impact of GPU Boost on a seismic imaging application (Reverse Time Migration) and an electromagnetic solver (Finite-difference time-domain).

GPU Boost is useful as not all applications have the same power profile. The K40 has a maximum 235W power capacity. For example, an application that runs at an average power consumption of 180W at the base frequency will have a 55W power headroom. By increasing the clock frequency, the application theoretically can take advantage of the full 235W capacity.

Enabling GPU Boost

GPU Boost is controlled using NVIDIA’s System Management Interface utility (nvidia-smi) with the following commands:

Command Explanation
nvidia-smi –q –d SUPPORTED_CLOCKS Show Supported Clock Frequencies
nvidia-smi –ac <MEM clock, Graphics clock> Set the Memory and Graphics Clock Frequency
nvidia-smi –q –d CLOCK Shows current mode
nvidia-smi –rac Resets all clocks
nvidia-smi –acp 0 Allows non-root to change clocks

Maximizing Shared Memory Bandwidth on NVIDIA Kepler GPUs

Shared Memory Configurations

On NVIDIA Kepler (Compute 3.x) GPUs, shared memory has 32 banks, with each bank having a bandwidth of 64-bits per clock cycle. On Fermi GPUs (Compute 2.x) shared memory also has 32 banks, but the bandwidth per bank is only 32-bits per clock cycle.

In Fermi GPUs, successive 32-bit words are assigned to successive banks. Kepler GPUs are configurable where either successive 32-bit words OR 64-bit words are assigned to successive banks. You can specify your desired configuration by calling cudaDeviceSharedMemConfig() from the host prior to launching your kernel, and specifying one of cudaSharedMemBankSizeDefault, cudaSharedMemBankSizeFourByte, or cudaSharedMemBankSizeEightByte.

In the eight byte configuration, bank conflicts occur when two or more threads in a warp request different 64-bit words from the same bank. In the four byte configuration, bank conflicts occur if two or more threads in a warp access 32-bit words from the same bank where those words span multiple 64-word aligned segments (Figure 1).

Figure 1 - Bank Conflicts in the Four Byte Configuration

Performance Implications

For kernels operating on double precision values, the eight byte configuration on Kepler allows loads/stores to consecutive doubles without bank conflicts. This is an improvement over Fermi GPUs, where accessing double precision values always incurred bank conflicts. If your kernels are operating on single precision values, you are only utilizing half the available shared memory bandwidth. Modifying your kernels to operate on 64-bit values in shared memory (perhaps using float2) may improve performance if shared memory throughput is a performance bottleneck for your kernel.

Webinar: An Introduction to OpenCL for Altera FPGAs

Join Chris Mason as he presents an informative 25 minute introduction on how to program Altera FPGAs with OpenCL. The webinar begins with an overview of the OpenCL programming model and data parallelism. Chris then discusses simple OpenCL syntax, kernels and memory spaces. Finally Chris examines how OpenCL is mapped to the Altera FPGA architecture. He outlines how to compile an OpenCL kernel to Altera FPGAs and summarizes OpenCL optimizations techniques.

Click here to find out more about OpenCL for Altera FPGA's.

Webinar: CUDA Tools for Optimal Performance and Productivity

Presented by Dan Cyca

This webinar provides an overview of the profiling techniques and the tools available to help you optimize your code. We will examine NVIDIA’s Visual Profiler and cuobjdump and highlight the various methods available for understanding the performance of your CUDA program. The second part of the session will focus on debugging techniques and the tools available to help you identify issues in your kernels. The debugging tools provided in CUDA 5.5 including NSight and cuda-memcheck will be discussed.

Click here to find out more about Acceleware's CUDA training.

Webinar: How to Improve Performance using the CUDA Memory Model and Features of the Kepler Architecture

Presented by Chris Mason

Join Chris Mason, Acceleware Product Manager, and explore the memory model of the GPU and the memory enhancements available in the new Kepler architecture and how these will affect your performance optimization. The webinar will begin with an essential overview of GPU architecture and thread cooperation before focusing on the different memory types available on the GPU. We will define shared, constant and global memory and discuss the best locations to store your application data for optimized performance. The shuffle instruction, new shared memory configurations and Read-Only Data Cache of the Kepler architecture are introduced and optimization techniques discussed.

Click here to find out more about Acceleware's CUDA training.

Constant Cache vs. Read-Only Cache - Part 2

Match fixing

Dan’s previous blog highlights the major differences between constant memory and the read-only data cache available on CUDA devices built on the Compute Capability 3.5 architecture (Tesla K20c/K20x, GeForce GTX Titan/780 GPUs).

In that particular example (a 21-point 1D filter operation), the constant memory version [Ferrari] fared roughly 1.7X faster than the read-only cache [mini-van]. It’s actually not a surprising outcome, given that the same coefficient values are broadcast to each thread in the warp. (The race was fixed!). If all threads in the warp accessed different coefficients values, the accesses would be serialized, and we would expect the read-only cache to clearly win the day. Unfortunately, it’s not always so straightforward to decide which type of memory space to use to achieve maximum performance.

Today we’ll examine a series of “races” between the read-only cache and constant memory. We’ll even throw in global memory [commuter bus] for good measure.

“Is this line secure?”

Encryption has received increased global interest lately and will serve as the basis for our series of races. For this experiment, we’ll implement a simple substitution encryption kernel using CUDA. The idea is fairly straightforward: Given an input P [plaintext], along with a substitution table S, produce the output C [ciphertext] = S(P). Here are three simple CUDA kernels to perform this substitution, where the substitution table is accessed either via the read-only cache, constant memory, or global memory. Each CUDA thread handles a single index in the input.

__global__ void SubstitutionEncrypt_ReadOnlyCache(const int * __restrict__ plaintext,
                const int * __restrict__ substTable,
                int * __restrict__ ciphertext)
{
    const int index = blockIdx.x * blockDim.x + threadIdx.x;
    ciphertext[index] = __ldg(&substTable[plaintext[index]]);
}
 
__global__ void SubstitutionEncrypt_ConstantMemory(const int * __restrict__ plaintext,
                int * __restrict__ ciphertext)
{
    const int index = blockIdx.x * blockDim.x + threadIdx.x;
    ciphertext[index] = substTable_c[plaintext[index]];
}
 
__global__ void SubstitutionEncrypt_GlobalMemory(int * plaintext,
                int * substTable,
                int * ciphertext)
{
    const int index = blockIdx.x * blockDim.x + threadIdx.x;
    ciphertext[index] = substTable[plaintext[index]];
}

It is important to note that the memory access pattern [...]

Webinar: An Introduction to GPU Programming

Presented by Chris Mason

This is an informative introduction to GPU computing and a recording of our live webinar in the GTC Express series, September 25th, 2013. The tutorial begins with a brief overview of CUDA and data-parallelism before focusing on the GPU programming model. We are exploring the fundamentals of GPU kernels, host and device responsibilities, CUDA syntax and thread hierarchy. Enjoy!

Click here to find out more about Acceleware's CUDA training.

 

Pages

Subscribe to RSS - blogs