CUDA Notes

State of GPU Virtualization for CUDA Applications 2014


Wide spread corporate adoption of virtualization technologies have led some users to rely on Virtual Machines (VMs). When these users or IT administrators wish to start using CUDA, often the first thought is to spin up a new VM. Success is not guaranteed as not all virtualization technologies support CUDA. A survey of GPU virtualization technologies for running CUDA applications is presented. To support CUDA, a VM must be able to present a supported CUDA device to the VM’s operating system and install the NVIDIA graphics driver.

GPU Virtualization Terms 

  • Device Pass-Through: This is the simplest virtualization model where the entire GPU is presented to the VM as if directly connected. The virtual GPU is usable by only one VM. The CPU equivalent is assigning a single core for exclusive use by a VM. VMware calls this mode virtual Direct Graphics Accelerator (vDGA).
  • Partitioning: A GPU is split into virtual GPUs that are used independently by a VM. 
  • Timesharing: Timesharing involves sharing the GPU or portion of between multiple VMs. Also known as oversubscription or multiplexing, the technology for timesharing CPUs is mature while GPU timesharing is being introduced. 
  • Live Migration: The ability to move a running VM from one VM host to another without downtime.

Virtualization Support for CUDA 

CUDA support from five virtualization technology vendors accounting for most of the virtualization market was examined. The five major vendors are VMWare, Microsoft, Oracle, Citrix and Red Hat. A summary is shown in the table below.

GPU Boost on NVIDIA’s Tesla K40 GPUs

What is GPU Boost?

GPU Boost is a new user controllable feature to change the processor clock speed on the Tesla K40 GPU. NVIDIA is currently supporting 4 selectable Stream Processor clock speeds and two selectable Memory Clock Speeds on the K40.  The base clock for the Stream Processors is 745MHz and the three selectable options are 666 MHz, 810 MHz and 875MHz (finally, a manufacturer not afraid of superstition!). The base Memory Clock frequencies are 3004MHz (default) and 324MHz (idle). Only the effects of tuning the Stream Processor clock are discussed as there is no application performance increase that results from adjusting the Memory Clock. This blog shows the impact of GPU Boost on a seismic imaging application (Reverse Time Migration) and an electromagnetic solver (Finite-difference time-domain).

GPU Boost is useful as not all applications have the same power profile. The K40 has a maximum 235W power capacity. For example, an application that runs at an average power consumption of 180W at the base frequency will have a 55W power headroom. By increasing the clock frequency, the application theoretically can take advantage of the full 235W capacity.

Enabling GPU Boost

GPU Boost is controlled using NVIDIA’s System Management Interface utility (nvidia-smi) with the following commands:

Command Explanation
nvidia-smi –q –d SUPPORTED_CLOCKS Show Supported Clock Frequencies
nvidia-smi –ac <MEM clock, Graphics clock> Set the Memory and Graphics Clock Frequency
nvidia-smi –q –d CLOCK Shows current mode
nvidia-smi –rac Resets all clocks
nvidia-smi –acp 0 Allows non-root to change clocks

Maximizing Shared Memory Bandwidth on NVIDIA Kepler GPUs

Shared Memory Configurations

On NVIDIA Kepler (Compute 3.x) GPUs, shared memory has 32 banks, with each bank having a bandwidth of 64-bits per clock cycle. On Fermi GPUs (Compute 2.x) shared memory also has 32 banks, but the bandwidth per bank is only 32-bits per clock cycle.

In Fermi GPUs, successive 32-bit words are assigned to successive banks. Kepler GPUs are configurable where either successive 32-bit words OR 64-bit words are assigned to successive banks. You can specify your desired configuration by calling cudaDeviceSharedMemConfig() from the host prior to launching your kernel, and specifying one of cudaSharedMemBankSizeDefault, cudaSharedMemBankSizeFourByte, or cudaSharedMemBankSizeEightByte.

In the eight byte configuration, bank conflicts occur when two or more threads in a warp request different 64-bit words from the same bank. In the four byte configuration, bank conflicts occur if two or more threads in a warp access 32-bit words from the same bank where those words span multiple 64-word aligned segments (Figure 1).

Figure 1 - Bank Conflicts in the Four Byte Configuration

Performance Implications

For kernels operating on double precision values, the eight byte configuration on Kepler allows loads/stores to consecutive doubles without bank conflicts. This is an improvement over Fermi GPUs, where accessing double precision values always incurred bank conflicts. If your kernels are operating on single precision values, you are only utilizing half the available shared memory bandwidth. Modifying your kernels to operate on 64-bit values in shared memory (perhaps using float2) may improve performance if shared memory throughput is a performance bottleneck for your kernel.

Constant Cache vs. Read-Only Cache - Part 1

The Checkered Flag

During a recent webinar one of our attendees asked us the question:
“So which is faster, the constant cache or the read-only data cache?”

The constant cache and the read-only cache are intended for quite different purposes. Let’s briefly consider a Ferrari 458 Speciale and Dodge Grand Caravan instead. If you need to post a fast lap time at Nürburgring, take the Ferrari. However, the Dodge is a better choice if you need to get 6 passengers and their luggage to a weekend getaway destination.

With that in mind, let’s take a closer look at the constant cache and the read-only data cache.

Constant Memory

Constant memory has been available to CUDA developers since the very first generation of CUDA capable GPUs (Compute Capability 1.0). Data in constant memory:

  • Resides in a 64KB partition of device memory
  • Is accessed through an 8KB cache on each SM(X)
  • Is intended to be broadcast to all threads in a warp

The last point is important. If all the threads in the warp request the same value, that value is delivered to all threads in a single cycle. If the threads in a warp request N different values, the request is serialized and the N values are delivered one at a time over N clock cycles. You want to ensure your indexes into constant memory arrays are not functions of threadIdx.x.

Read-Only Data Cache

The read-only data cache was introduced with Compute Capability 3.5 architectures (e.g. Tesla K20c/K20X and GeForce GTX Titan/780 GPUs). Similar functionality has been available since Compute Capability 1.0 devices, although you needed to use the somewhat unconventional texture path to take advantage of it.

Each SMX has a 48KB read-only cache. The CUDA compiler automatically accesses data via the read-only cache when it can determine that data is read-only for the lifetime of your kernel. In practice, you need to qualify pointers as const and __restrict__ before the compiler can satisfy this condition. You can also specify a read-only data cache access with the __ldg() intrinsic function.

NVIDIA Hardware Comparison – GeForce GTX 690 vs. Tesla K10

GeForce GTX 690 vs. Tesla K10

During a recent webinar one of our attendees asked us the question:

“Which hardware is faster – the GeForce GTX 690 or the Tesla K10?”

The short answer: GeForce GTX 690.

That isn’t the entire story though. The specifications for both cards are shown in the table below:

Card Number of GPUs Memory BW (GB/s) SP Performance (GFlops) Memory (GB) Power (W) Cooling
GTX 690 2 (GK104) 284.5 5622 4 300 Fan
K10 2 (GK104) 320 4577 8 250 Passive

While the raw throughput on the GeForce GTX 690 is higher than the Tesla K10, the memory on the GTX 690 is half of the size of the K10. Depending on your application, memory size may outweigh the benefit of the slightly higher throughput on the GeForce.

Additionally, NVIDIA’s Tesla product line has features not available for GeForce hardware:

  • ECC Memory Protection – improved memory reliability
  • Manufacturing Quality – hardware is manufactured and tested by NVIDIA to guarantee highest reliability
  • 1.5 to 2 Year Availability – useful for enterprise customers who need to replace or add hardware after initial deployment
  • Cluster and GPU Management Software – diagnostics useful for monitoring hardware in clusters
  • Advanced Driver Support – TCC drivers allows the use of Windows Remote Desktop and avoids the Windows watchdog timer when running large kernels
  • Form Factors – Tesla is supported in certified servers, blades and workstations

The full list of Tesla benefits can be found here:

In almost all cases, our customers deploy Tesla hardware in production environments. The memory size, reliability and long availability outweigh the modest performance benefits of the GeForce product line.

Finally, for double precision computations you will want to consider the K20X or GeForce GTX Titan since the double precision performance on the GTX 690 and K10 is approximately 1/24th the single precision throughput. But if you can get away with single precision calculations, the GeForce GTX 690 and Tesla K10 are great hardware choices!


Does your CUDA application need to target a specific GPU? If you are writing GPU enabled code, you would typically use a device query to select the desired GPUs. However, a quick and easy solution for testing is to use the environment variable CUDA_VISIBLE_DEVICES to restrict the devices that your CUDA application sees. This can be useful if you are attempting to share resources on a node or you want your GPU enabled executable to target a specific GPU.

Subscribe to RSS - CUDA Notes