GPU Overview
GPU resource requests are handled slightly differently from what was described in the Projects section. In case you need to request GPUs, the first step is to open a ticket to the GPU Platform Consultancy functional element. The consultants will help you decide which of the services better suits your needs.
Services
OpenStack Project with GPU Flavors
This option is identical to the one described in the Projects section, except that GPU flavors will be assigned to your project. You can then launch instances with GPUs. The available flavors are:
Flavor Name | GPU | RAM | vCPUs | Disk | Ephemeral | Comments |
---|---|---|---|---|---|---|
g1.xlarge | V100 | 16 GB | 4 | 56 GB | 96 GB | - |
g1.4xlarge | V100 (4x) | 64 GB | 16 | 80 GB | 528 GB | - |
g2.xlarge | T4 | 16 GB | 4 | 64 GB | 192 GB | - |
g2.5xlarge | T4 | 168 GB | 28 | 160 GB | 1200 GB | - |
g3.xlarge | V100S | 16 GB | 4 | 64 GB | 192 GB | - |
g3.4xlarge | V100S (4x) | 64 GB | 16 | 128 GB | 896 GB | - |
g4.p1.40g | A100 (1x) | 120 GB | 16 | 600 GB | - | AMD CPUs |
g4.p2.40g | A100 (2x) | 240 GB | 32 | 1200 GB | - | AMD CPUs |
g4.p4.40g | A100 (4x) | 480 GB | 64 | 2400 GB | - | AMD CPUs |
vg1.xlarge | T4 (vGPU) | 16 GB | 4 | 64 GB | 192 GB | Specific configuration here |
Note: Adequate GPU drivers have to be installed (detailed here).
Note: Baremetal nodes with GPUs are also possible in certain cases, please open a ticket for these requests.
Policies
GPU resources are a rare resource and expensive resources, and the PCI-passthrough model limits ITs possibilities to monitor their (efficient) usage, which needs to be done on the guests. GPU resoures can be allocated for testing periods of up to 4 months, after which resources are claimed back and a usage report is expected. Longer loan times are possible but require a justification and management approval.
Container Service Clusters
After having GPU resources allocated to your OpenStack project, you can deploy clusters with GPUs by setting a label (explained here).
Batch Service GPU jobs
The Batch service at CERN already allows the submission of GPU jobs (examples here). Batch not only allows to submit jobs in the typical batch system form, but also using docker, singularity and interactive jobs (including running GUI applications).
GitLab (Continuous Integration)
A number of shared runners in CERN GitLab offer GPUs.
Check here for configuration information and examples.
lxplus
The lxplus service offers lxplus-gpu.cern.ch for shared GPU instances - with limited isolation and performance.
VM Configuration
When using GPUs directly in virtual machines you need to handle driver installation and configuration.
Driver Installation
Note: Virtual GPU driver installation is different (see here).
To install NVIDIA drivers, open the CUDA Toolkit Downloads page and select the options related to your system. As an installer type, we recommend choosing the 'network' option. Having selected all options, you will be prompted with a succinct installation instructions box.
As a rule of thumb, you can verify that the drivers have been correctly installed if you can successfully run 'nvidia-smi' in a terminal (Linux) or if you see the GPU model you have assigned in the device manager, under display adapters (Windows).
For more detailed instructions, such as pre- and post-installation actions, see the Installation Guide for Linux or the Installation Guide for Microsoft Windows.
Trouble shooting
Drivers do not find the GPU
Depending on the GOá¹”U used, GPU drivers on CS8 and CS9 guests (maybe others) sometimes fail to initialise the GPU due to PCI bus address space issues. CC7 guests usually work fine. The root cause of this is under investigation. There are two known work arounds:
- Boot the guest in BIOS mode instead of UEFI. See the documentation how this can be changed.
- Add the kernel boot options
pci=realloc pci=nocrs
at boot time.
Virtual GPUs
Note: Running Windows is not supported with the current license.
For the vGPUs to operate at full capacity, licensing is required. For CC7, CS8 and C9 we offer an rpm package which installs the required software takes care of the getting a lease. It should work as well on Redhat Enterprise server.
For puppet managed machines, simply include
include gpu
For other operating systems or non-centrally managed machines please get in touch with the cloud team by opening a support call.
Installing CUDA Toolkit in a vGPU VM
The first step in installing the CUDA Toolkit is to check which is the latest compatible version with vGPU in this table (currently deployed vGPU software release: 13.0). Then, from the downloads archive you can find the corresponding CUDA Toolkit download link.
From the downloads page, pick the runfile installer after selecting your target OS. Not using the runfile can result in deploying an unsupported version of CUDA and overriding the vGPU driver.
During the interactive installation of the runfile, it is important to deselect the driver install option. Alternatively, you can run the installer non-interactively with the following flags:
$ sudo <CudaInstaller>.run --silent --toolkit --samples
Please check the detailed installation steps as there are relevant pre- and post-installation actions (such as installing g++ and altering the PATH environment variable).
To uninstall this runfile type of installation, simply run:
$ cuda-uninstaller
GPU accelerated Docker containers
The NVIDIA Container Toolkit is required to run GPU accelerated Docker containers (not required for Kubernetes). Installation instructions are available here.
Additional resources:
- NVIDIA Driver Downloads (all products and OS)
- Tesla Driver release notes
- Tesla NVIDIA Driver Installation Quickstart Guide (Linux)
- CUDA Quick Start Guide (Linux and Windows)