Running Applications over GPUs

On This Page

Overview

The platform supports accelerated code execution over NVIDIA graphics processing units (GPUs):

Horovod

The platform has a default (pre-deployed) shared single-instance tenant-wide Kubeflow MPI Operator service (mpi-operator), which facilitates Uber's Horovod distributed deep-learning framework. Horovod, which is already preinstalled as part of the platform's Jupyter Notebook service, is widely used for creating machine-learning models that are trained simultaneously over multiple GPUs or CPUs.

You can use Horovod to convert a single-GPU TensorFlow, Keras, or PyTorch model-training program to a distributed multi-GPU program. The objective is to speed up your model training with minimal changes to your existing single-GPU code and without complicating the execution. Note that you can also run Horovod code over CPUs with just minor modification. For an example of using Horovod on the platform, see the image-classification-with-distributed-training demo.

Note
  • To run Horovod code, ensure that the mpi-operator platform service is enabled. (This service is enabled by default.)
  • Horovod applications allocate GPUs dynamically from among the available GPUs in the system; they don't use the GPU resources of the parent Jupyter Notebook service. See also the Jupyter GPU resources note.

RAPIDS

You can use NVIDIA's RAPIDS open-source libraries suite to execute end-to-end data science and analytics pipelines entirely on GPUs.

To use the cuDF and cuML RAPIDS libraries, you need to create a RAPIDS Conda environment. For example, you can run the following command from a Jupyter notebook or terminal to create a RAPIDS Conda environment named rapids:

conda create -n rapids -c rapidsai -c nvidia -c anaconda -c conda-forge -c defaults ipykernel rapids=0.17 python=3.7) cudatoolkit=11.0

For more information about using Conda to create Python virtual environments, see the platform' virtual-env.ipynb tutorial Jupyter notebook.

For a comparison of performance benchmarks using the cuDF RAPIDS GPU DataFrame library and pandas DataFrames, see the gpu-cudf-vs-pd.ipynb tutorial notebook.

Note
  • RAPIDS supports GPUs with the NVIDIA Pascal architecture or better and compute capability 6.0+.

  • RAPIDS applications use the GPU resource of the parent Jupyter Notebook service. Therefore, you must configure at least one GPU resource for this service: from the dashboard Services page, select to edit your Jupyter Notebook service, select the Common Parameters tab, and set the Resources | GPU | Limit field to a value greater than zero. See also the Jupyter GPU resources note.

For more information about using RAPIDS to run applications over GPUs, see Ingesting and Preparing Data.

Jupyter GPU Resources Note

In environments with GPUs, you can use the common Resources | GPU | Limit parameter of the Jupyter Notebook service to guarantee the configured number of GPUs for use by each service replica. In addition, you can enable scale to zero for a Jupyter Notebook service to automatically free up resources, including GPUs, when the service becomes idle, by checking the Enabled check box for the common Scale to zero parameter. When configuring your Jupyter Notebook service, take the following into account: while the Jupyter Notebook service is enabled and not scaled to zero, it monopolizes the configured amount of GPUs even when the GPUs aren't in use. RAPIDS applications use the GPUs that were allocated for the Jupyter Notebook service from which the code is executed, while Horovod applications allocate GPUs dynamically and don't use the GPUs of the parent Jupyter Notebook service. For example, on systems with limited GPU resources you might need to reduce the amount of GPU resources allocated to the Jupyter Notebook service or set it to zero to successfully run the Horovod code over GPUs.

See Also