Acceleware Blog

EAGE 2015 - Madrid, Spain

Exhibiting at EAGE 2015 in Madrid

Acceleware Booth at EAGE 2015


The annual European Association of Geoscientists & Engineers conference and exhibition took place in beautiful Madrid, Spain from June 1-4. In its’ 77th year, over 6,500 delegates attended this show, and Acceleware was there to take it all in. 

This year, the Acceleware booth had conducted dynamic talks on our latest developments focusing on 3 of our products; AxFWI, AxWave, and AxRTM.


AxFWI - A Revolutionary Modular FWI Platform


AxFWI is a revolutionary modular FWI platform that enables users to accelerate their research by integrating their own algorithms and code to a highly optimized RTM engine. The easy to use interface gives the user the control and flexibility required to run many different scenarios and yet benefit from a platform engineered for maximum performance.

FWI Formula

Opt-In L1 Caching of Global Loads on Some Kepler/Maxwell GPUs


CUDA developers generally strive for coalesced global memory accesses and/or explicit ‘caching’ of global data in shared memory.  However, sometimes algorithms have memory access patterns that cannot be coalesced, and that are not a good fit for shared memory.  Fermi GPUs have an automatic L1 cache on each streaming multiprocessor (SM) that can be beneficial for these problematic global memory access patterns.  First-generation Kepler GPUs have an automatic L1 cache on each SM, but it only caches local memory accesses.  In these GPUs, the lack of automatic L1 cache for global memory is partially offset by the introduction of a separate 48 KB read-only (née texture) cache per SM.

Opt-In L1 Caching on Kepler GPUs

NVIDIA quietly re-enabled L1 caching of global memory on GPUs based on the GK110B, GK20A, and GK210 chips.  The Tesla K40 (GK110B), Tesla K80 (GK210) and Tegra K1 (GK20A) all support this feature.  You can programmatically query whether a GPU supports caching global memory operations using cudaGetDeviceProperties and examining the globalL1CacheSupported property.  Examining the Compute Capability alone is not sufficient; Tesla K20/K20x and Tesla K40 both support Compute Capability 3.5, but only the K40 supports caching global memory in L1.

Webinar: Essential CUDA Optimization Techniques

Join Chris Mason, Product Manager at Acceleware, and learn how to optimize your algorithms for NVIDIA GPUs. This informative webinar provides an overview of the improved analysis performance tools available in CUDA 6.0 and key optimization strategies for compute, latency and memory bound problems. The webinar includes techniques for ensuring peak utilization of CUDA cores by choosing the optimal block size. For compute bound algorithms Chris discusses how to improve branching efficiency, intrinsic functions and loop unrolling. For memory bound algorithms, optimal access patterns for global and shared memory are presented, including a comparison between the Fermi and Kepler architectures.

Webinar Recording: An Introduction to OpenCL using AMD GPUs

Join Chris Mason, Product Manager at Acceleware, for an informative introduction to GPU Programming. The tutorial begins with a brief overview of OpenCL and data-parallelism before focusing on the GPU programming model. We also explore the fundamentals of GPU kernels, host and device responsibilities, OpenCL syntax and work-item hierarchy.

Webinar Recording: Asynchronous Operations & Dynamic Parallelism in CUDA

Join Chris Mason, Product Manager at Acceleware, as he leads attendees in a deep dive into asynchronous operations and how to maximize throughput on both the CPU and GPU with streams. Chris demonstrates how to build a CPU/GPU pipeline and how to design your algorithm to take advantage of asynchronous operations. The second part of the webinar focuses on dynamic parallelism.

Technical Paper: Modeling of Electromagnetic Assisted Oil Recovery

Presented at ICEAA - IEEE APWC 2014 this paper features an algorithm for the rigorous analysis of electromagnetic (EM) heating of heavy oil reservoirs. The algorithm combines a FDTD-based EM solver with a reservoir simulator. The paper addresses some of the challenges related to the integration of advanced electromagnetic codes with reservoir simulators and the necessary numerical technology that is needed to be developed, particularly as it relates to the multi-physics, translation of meshes, as well as petro-physical and EM material parameters. The challenges of calculating the electromagnetic dissipation in the vicinity of the antennas is also discussed. Example scenarios are presented and discussed.

Webinar Recording: GPU Architecture & the CUDA Memory Model

Join Chris Mason, Product Manager at Acceleware, and explore the memory model of the GPU! The webinar will begin with an essential overview of the GPU architecture and thread cooperation before focusing on the different memory types available on the GPU. Chris will define shared, constant and global memory and discuss the best locations to store your application data for optimized performance. Features available in the Kepler architecture such as shared memory configurations and Read-Only Data Cache are introduced and optimization techniques discussed.

State of GPU Virtualization for CUDA Applications 2014


Wide spread corporate adoption of virtualization technologies have led some users to rely on Virtual Machines (VMs). When these users or IT administrators wish to start using CUDA, often the first thought is to spin up a new VM. Success is not guaranteed as not all virtualization technologies support CUDA. A survey of GPU virtualization technologies for running CUDA applications is presented. To support CUDA, a VM must be able to present a supported CUDA device to the VM’s operating system and install the NVIDIA graphics driver.

GPU Virtualization Terms 

  • Device Pass-Through: This is the simplest virtualization model where the entire GPU is presented to the VM as if directly connected. The virtual GPU is usable by only one VM. The CPU equivalent is assigning a single core for exclusive use by a VM. VMware calls this mode virtual Direct Graphics Accelerator (vDGA).
  • Partitioning: A GPU is split into virtual GPUs that are used independently by a VM. 
  • Timesharing: Timesharing involves sharing the GPU or portion of between multiple VMs. Also known as oversubscription or multiplexing, the technology for timesharing CPUs is mature while GPU timesharing is being introduced. 
  • Live Migration: The ability to move a running VM from one VM host to another without downtime.

Virtualization Support for CUDA 

CUDA support from five virtualization technology vendors accounting for most of the virtualization market was examined. The five major vendors are VMWare, Microsoft, Oracle, Citrix and Red Hat. A summary is shown in the table below.

New whitepaper: OpenCL on FPGAs for GPU Programmers

In 2012 Altera announced their commitment to developing a SDK that would enable developers to program Altera field-programmable gate arrays (FPGAs) with Open Computing Language (OpenCL). This whitepaper introduces developers who have previous experience with general-purpose computing on graphics processing units (GPUs) to parallel programming targeting Altera FPGAs via the OpenCL framework.

This paper provides a brief overview of OpenCL, highlights some of the underlying technology and benefits behind Altera FPGAs, then focuses on how OpenCL kernels are executed on Altera FPGAs compared to on GPUs. This paper also presents the key differences in optimization techniques for targeting FPGAs.

Click here to download the whitepaper.

Altera Whitepaper


Subscribe to RSS - blogs