MLOPS LIVE

Best Practices for Succeeding with MLOps Webinar ft. Noah Gift author of 'Practical MLOps' - May 24th at 12pm ET

What Are GPUs for Machine Learning?

Selecting the right hardware to train and deploy a model is one of the most crucial aspects of machine learning (ML) optimization. With the right hardware, you can optimally position an ML pipeline, balancing cost and performance. 

In the ever-increasing landscape of accelerators for ML, GPUs are the technology that most consistently succeeds in powering state-of-the-art ML research and many real-world ML applications. As such, it is fundamental for ML practitioners to learn about.

In this post, we provide a high-level overview of what a GPU is, how it works, how it compares with other typical hardware for ML, and how to select the best GPU for your application.

What Is a GPU?

Graphics processing units, or GPUs, are specialized processing units designed to accelerate graphics rendering for gaming. Thanks to their unique capability to efficiently parallelize massive distributed computational processes, GPUs have successfully been applied to applications beyond their original remit. With the rise of big data and big deep learning models, it comes as no surprise that ML applications are among the most successful. 

While it is typical to tune GPU families to specific applications to maximize their performance, the core physical elements of a GPU are the same across the board. Much like a motherboard, a GPU is a printed circuit board composed of a processor for computation and BIOS for settings storage and diagnostics. Concerning memory, you can differentiate between integrated GPUs, which are positioned on the same die as the CPU and use system RAM, and dedicated GPUs, which are separate from the CPU and have their own vRAM. Machine learning is best applied on dedicated GPUs.

How Do GPUs Parallelize?

To support parallelism, GPUs use a Single Instruction, Multiple Data (SIMD) architecture, which allows you to apply the same process on multiple data groups efficiently. It is easy to see how this seamlessly maps to how ML training and batch inference run with batched data.

To further extend distribution, you can also move from a single-GPU setup to a multi-GPU setup. In a multi-GPU setup for ML, there are generally two ways to distribute computation:

  • Data parallelism: The same model is replicated between multiple machines and processes different batches of data on each. 
  • Model parallelism: Different components of the same models are split between multiple machines and process a single batch of data together.

There are different strategies for orchestrating and merging results. It’s most important to define whether the process should run synchronously (i.e., wait for each replica to complete each step) or asynchronously.

In contrast to intuition, moving to a more powerful GPU or a multi-GPU setting does not necessarily mean better performance. Bigger is beneficial only when the requested computation has saturated the current GPU setup. 

Keeping track of GPU utilization, memory access and usage, power usage, and temperatures with smart monitoring should be a top priority. It is also important to define the desired time to solution in order to position yourself in the previously mentioned optimum balance point between cost and performance.

GPUs vs. Other Accelerators

Selecting the right accelerator for ML is fundamental if you want to: 

  • Reduce development time, as both resources and data science time are less occupied;
  • Achieve faster inference time, which is necessary to meet SLAs and provide desired customer experience.

GPUs vs. CPUs

GPUs are highly specialized hardware. As such, pairing with CPUs is required for capabilities such as accessing peripherals or system RAM, and is highly recommended for single-thread processes. You can use both hardware solutions jointly or independently for machine learning, with expected performance depending on data and model requirements. 

GPUs are always best for deep learning training and big data, so much so that some of these tasks could take hundreds of years to run without them. These tasks are characterized by large batch sizes, which are beneficial for GPUs since I/O operations are a bottleneck. This is because data takes more time to pass to GPUs from memory, rather than CPUs, and GPUs can scale to much higher requirements on memory bandwidth than CPUs can.

A notable benefit of CPUs is that they are fully programmable with custom layers, architectures, and operations. On the other hand, GPUs are not as flexible as CPUs and require a high degree of knowledge to customize. 

As a rule of thumb, GPUs are expected to outperform the performance of CPUs on all parallelizable tasks, with the rare caveat of small batch sizes. However, they are also more expensive. When choosing between GPUs and CPUs beyond deep learning and big data, it is recommended to experiment with both to find the desired balance between cost and performance.

GPUs vs. ML-Specific Accelerators

In the past five years, various accelerators have appeared on the market, such as Google’s TPUs, which are highly specialized on specific ML applications. GPU designers have also been adjusting their architectures to fit ML applications, including notable advancements in sparsity and numerical representation, as well as how they map to the topology, which has unlocked the use of GPUs for online inference. 

Considering that ML-specific accelerators are generally more expensive, too highly niched to apply to the wide spectrum of ML applications,  and do not provide large performance improvements, GPUs are still the most desirable choice.

How to Choose the Best GPU for Your Application

When selecting a GPU, both the hardware and the software that is built on top of it are important.

Starting from the hardware perspective, it is recommended to select a GPU with the highest bandwidth within your budget, while also maximizing for:

  • Number of cores: Determines the speed at which a GPU can process the data;
  • RAM size: Determines how much data the GPU can handle at once;
  • Processing power: Determines the speed at which the GPU can compute the data/perform tasks.

Make sure to note what software is available for your top GPU picks, as well as the required utilization strategy (single GPU vs. multiple GPU). All main providers offer specialized families for AI and provide libraries and packages with relevant built-in ML functionalities. 

NVIDIA, powered with CUDA and CuDNN libraries, is the best interface to access and manipulate GPU resources. ML frameworks, such as TensorFlow and Pytorch, abstract most of its functionalities and complexities for both single-GPU and multi-GPU processing. If you need more customization, NVIDIA provides a high degree of flexibility and visibility, but you will still need expert domain knowledge.

Lastly, you’ll need to choose where to host your GPUs. Rather than self managing, it is highly recommended to go serverless with one of the big cloud providers or, to minimize overhead when getting started, try a ML-specialized provider like Iguazio. The Iguazio MLOps Platform includes GPU as a Service capabilities, to help customers use their GPU investments efficiently, saving heavy compute costs, simplifying complex infrastructure, and improving performance.