GPU Boost on NVIDIA’s Tesla K40 GPUs

What is GPU Boost?

GPU Boost is a new user controllable feature to change the processor clock speed on the Tesla K40 GPU. NVIDIA is currently supporting 4 selectable Stream Processor clock speeds and two selectable Memory Clock Speeds on the K40.  The base clock for the Stream Processors is 745MHz and the three selectable options are 666 MHz, 810 MHz and 875MHz (finally, a manufacturer not afraid of superstition!). The base Memory Clock frequencies are 3004MHz (default) and 324MHz (idle). Only the effects of tuning the Stream Processor clock are discussed as there is no application performance increase that results from adjusting the Memory Clock. This blog shows the impact of GPU Boost on a seismic imaging application (Reverse Time Migration) and an electromagnetic solver (Finite-difference time-domain).

GPU Boost is useful as not all applications have the same power profile. The K40 has a maximum 235W power capacity. For example, an application that runs at an average power consumption of 180W at the base frequency will have a 55W power headroom. By increasing the clock frequency, the application theoretically can take advantage of the full 235W capacity.

Enabling GPU Boost

GPU Boost is controlled using NVIDIA’s System Management Interface utility (nvidia-smi) with the following commands:

Command Explanation
nvidia-smi –q –d SUPPORTED_CLOCKS Show Supported Clock Frequencies
nvidia-smi –ac <MEM clock, Graphics clock> Set the Memory and Graphics Clock Frequency
nvidia-smi –q –d CLOCK Shows current mode
nvidia-smi –rac Resets all clocks
nvidia-smi –acp 0 Allows non-root to change clocks

On the K40, an nvidia-smi query to find the supported clock frequencies gives the following output:

[srahim@corsair3 ~]$ nvidia-smi -q -d SUPPORTED_CLOCKS
==============NVSMI LOG==============
Timestamp                           : Mon Mar 10 17:38:22 2014
Driver Version                      : 331.20


Attached GPUs                       : 2
GPU 0000:05:00.0
    Supported Clocks
        Memory                      : 3004 MHz
            Graphics                : 875 MHz
            Graphics                : 810 MHz
            Graphics                : 745 MHz
            Graphics                : 666 MHz
        Memory                      : 324 MHz
            Graphics                : 324 MHz


GPU 0000:42:00.0
    Supported Clocks
        Memory                      : 3004 MHz
            Graphics                : 875 MHz
            Graphics                : 810 MHz
            Graphics                : 745 MHz
            Graphics                : 666 MHz
        Memory                      : 324 MHz
            Graphics                : 324 MHz

 

To set the clock speed to 666MHz, run

[srahim@corsair3 ~]$ sudo nvidia-smi –ac 3004,666
Applications clocks set to "(MEM 3004, SM 666)" for GPU 0000:05:00.0
All done.

If you try to set an unsupported clock speed, nvidia-smi shows a helpful message.

[srahim@corsair3 ~]$ sudo nvidia-smi -ac 3004,888
Specified clock combination "(MEM 3004, SM 888)" is not supported for GPU 0000:05:00.0. Run 'nvidia-smi -q -d SUPPORTED_CLOCKS' to see list of supported clock combinations
Terminating early due to previous errors.

The current clock speed is checked with the following command:

[srahim@corsair3 ~]$ nvidia-smi -q -d CLOCK
==============NVSMI LOG==============


Timestamp                           : Tue Mar 11 10:40:16 2014
Driver Version                      : 331.20


Attached GPUs                       : 2
GPU 0000:05:00.0
    Clocks
        Graphics                    : 324 MHz
        SM                          : 324 MHz
        Memory                      : 324 MHz
    Applications Clocks
        Graphics                    : 666 MHz
        Memory                      : 3004 MHz
    Default Applications Clocks
        Graphics                    : 745 MHz
        Memory                      : 3004 MHz
    Max Clocks
        Graphics                    : 875 MHz
        SM                          : 875 MHz
        Memory                      : 3004 MHz

The GPU Boost settings are not persistent between reboots or driver unloads and should be scripted if persistence is desired. Unless the NVIDIA driver persistence mode is set with nvidia-smi –pm 1, the driver may unload when the GPU is idle.

Unlike Intel’s Turbo Boost, GPU Boost is not on by default. This puts the impetus on the end user or system administrator to take advantage of this feature. An application can programmatically change the Boost clock with NVML if run with appropriate permissions. Run nvidia-smi –acp 0 to grant non-root users permission to change clocks. Two caveats on the uses of GPU Boost from the document “NVIDIA GPU Boost for Tesla” are:

  1. An important point to remember is that no matter which clocks the end user selects, if at any time the power monitoring algorithm detects that the application may exceed the 235 W, the GPU comes down to a lower clock level as a precaution. Once the power falls below 235 W the GPU will raise its core clock to the selected clock. This happens automatically and the Tesla K40 does have a few clock levels below the base clock to handle any power digressions.
  2. If the workload runs on multiple GPUs and is sensitive to all GPUs running at the same clock, the user may need to try out which particular clock works best for all GPUs.

If you are not running on multiple GPUs or multiple nodes, you should crank up the GPU clock speed to the maximum frequency. Keep the GPU clock at the default value of 745MHz or lower only if there are power consumption concerns.

GPU Boost Benchmarks

The effect of GPU Boost is demonstrated with two software solvers, FDTD and RTM. Both algorithms are memory bandwidth intensive.

  1. Reverse Time Migration (RTM) is a depth migration algorithm used to image complex geologies. RTM is based on a two way acoustic equation. Two algorithms for wave propagation are used for benchmark purposes, Isotropic and TTI (Tilted Transverse Isotropy). Only the core algorithm was simulated for the benchmarks. See acceleware.com/rtm and acceleware.com/seismic-forward-modeling for more information.
  2. Finite-difference time-domain (FDTD) is a numerical technique used to approximate electromagnetic wave propagation in the time domain. Maxwell’s equations describing EM wave propagations are discretized using the central difference approximation to derive the FDTD time stepping equations. Acceleware’s FDTD library provides customers with an interface to take advantage of the latest GPU acceleration. Please see acceleware.com/fdtd-solvers for more details. The benchmark was done with a simple resonator without complex materials.

GPU Boost Results

As shown in the graph above, the performance of the RTM and FDTD benchmarks is normalized to the GPU base clock rate of 745MHz. A linear reference line is on the chart to show how GPU performance should scale if Stream Processor clock speed was the only factor in determining performance. The theoretical linear speed up due to GPU Boost is 1.175 times faster performance at 875MHz over the based clock rate of 745MHz. At the maximum Boost clock of 875MHz, FDTD runs 1.142 times faster. RTM TTI is 1.185 faster and RTM Isotropic is 1.114 times faster. FDTD and RTM increase in performance very close to the linear reference line. A little surprising is that at certain points, the performance is higher than expected.

Conclusion

GPU Boost is a relatively simple way to increase performance by up to 17.5%. It is certainly easier than optimizing a CUDA kernel. Unless you are doing some crazy multi-node MPI application, always boost your clock speed for higher performance!

References

NVIDIA GPU Boost for Tesla (pdf)

Deploying Clusters with NVIDIA Tesla GPUs, Dale Southard, NVIDIA (pdf)

 

Comments

Actually, as part of my talk at GTC (S4453) I will show and discuss the use of  GPU boost for memory bandwidth bound applications (Lattice QCD).

Your plot is astoishingly similar to the one I did. Although I discussed the relative time needed to run at the different frequencies.

Just curious: Did you have ECC turned on or off?

See you at GTC 2014!

 

Mathias

Hi Mathias,

The results are with ECC off. This plot is a subset of our internal benchmarking we do. Runs with ECC turned on show very similar results. My opinion is that GPU Boost should be turned on by default and the users with requirements for synchronous operations deactivate the settings when required. Your results only further the case. I suspect the power implications are why GPU Boost is deactivated by default. Running the few K40s Acceleware has at full power does not have the same impact of running a thousand node cluster at full GPU power.

Saad