GPU Overview
Note
GPU resource requests are handled slightly differently from what was described in the Projects section.
Services
Interactive GPU resources on lxplus
The lxplus service offers lxplus-gpu.cern.ch for shared GPU instances - with limited isolation and performance.
Batch Service GPU jobs
The Batch service at CERN already allows the submission of GPU jobs (examples here). Batch not only allows to submit jobs in the typical batch system form, but also using docker, singularity and interactive jobs (including running GUI applications).
GitLab (Continuous Integration)
A number of shared runners in CERN GitLab offer GPUs.
Check here for configuration information and examples.
Kubeflow (Machine learning)
Also the Kubeflow project provides access to GPUs. It comes with a variety of different additional features which are useful for the machine learning use case.
SWAN
For interactive work using notebooks, please take a look at the SWAN service. It provides access to a set of T4 GPUs.
Dedicated GPU resources
If and only if none of the above services matches your use case, you can ask for dedicated GPU resources. When doing the request, please give a good justification why none of these offerings do not match your use case.
In case you need to request GPUs, the first step is to open a ticket to the GPU Platform Consultancy functional element. Based on your input, the consultants will help you decide which of the services better suits your needs.
OpenStack project with GPU flavors in pass-through mode
With this option the GPU flavors will be made available to your project. You can then launch instances with GPUs. The available flavors are:
Flavor Name | GPU | RAM | vCPUs | Disk | Ephemeral | Comments |
---|---|---|---|---|---|---|
g1.xlarge | V100 | 16 GB | 4 | 56 GB | 96 GB | [^1], deprecated |
g1.4xlarge | V100 (4x) | 64 GB | 16 | 80 GB | 528 GB | [^1] |
g2.xlarge | T4 | 16 GB | 4 | 64 GB | 192 GB | [^1], deprecated |
g2.5xlarge | T4 | 168 GB | 28 | 160 GB | 1200 GB | [^1] |
g3.xlarge | V100S | 16 GB | 4 | 64 GB | 192 GB | [^1] |
g3.4xlarge | V100S (4x) | 64 GB | 16 | 128 GB | 896 GB | [^1] |
g4.p1.40g | A100 (1x) | 120 GB | 16 | 600 GB | - | [^1], AMD CPUs |
g4.p2.40g | A100 (2x) | 240 GB | 32 | 1200 GB | - | [^1], AMD CPUs |
g4.p4.40g | A100 (4x) | 480 GB | 64 | 2400 GB | - | [^1], AMD CPUs |
Note: Baremetal nodes with GPUs are also possible in certain cases, please open a ticket for these requests.
[^1]: Adequate GPU drivers have to be installed (detailed here).
Policies
GPU resources are a scarce resource and expensive resource. The PCI-passthrough deployment model limits IT's possibilities to monitor their (efficient) usage, therefore monitoring has to be done by the users directly on the guests. GPU resoures can be allocated for testing periods of up to 4 months, after which resources are claimed back and a usage report is expected. Longer lease times are possible but require a justification and management approval.
Container Service Clusters
After having GPU resources allocated to your OpenStack project, you can deploy clusters with GPUs by setting a label (explained here).
VM Configuration
When using GPUs directly in virtual machines you need to handle driver installation and configuration.
Driver Installation for GPU passthrough
To install NVIDIA drivers, open the CUDA Toolkit Downloads page and select the options related to your system. As an installer type, we recommend choosing the 'network' option. Having selected all options, you will be prompted with a succinct installation instructions box. Cuda rpm packages for RHEL and compatible OS are also available from here.
As a rule of thumb, you can verify that the drivers have been correctly installed if you can successfully run 'nvidia-smi' in a terminal (Linux) or if you see the GPU model you have assigned in the device manager, under display adapters (Windows).
For more detailed instructions, such as pre- and post-installation actions, see the Installation Guide for Linux or the Installation Guide for Microsoft Windows.
GPU monitoring
As mentioned earlier already, GPU monitoring has to be performed on the guests. In GPU passthrough-mode, the device is passed on for exclusive use by the virtual machine, and IT loses access to the device. Monitoring data is expected if the VM owner asks for a prolongation of the lease period. There are different options:
- The drivers ship with nvidia-smi which gives, when run on the command line, statistics about the GPU usage. The output can be parsed and stored in a convenient format.
- The collectd-cuda package package does exactly this. It is to be used as a collectd plugin.
- The better solution is eventually DCGM: This tool is provided and maintained by NVIDIA, and has been open-sourced. You can find the latest version on github. It comes with a number of backends, for example with a plugin for collectd which can be used to send the data to the centralised monitoring infrastructure at CERN.
- For guests using the configuration management system, there are collectd puppet plugins for collectd-cuda as well as for DCGM.
Trouble shooting
OpenGL/Vulcan not working on a Windows guest in GPU passthrough mode
If you need to run OpenGL or Vulcan graphics applications, this requires a license and specific, commercial drivers which you can get from IT. We have a few such licenses available. Please get in touch with IT.
Drivers do not find the GPU in PCI passthrough mode
Depending on the GPU used, GPU drivers on RHEL8 and RHEL9 guests (or equivalent) sometimes fail to initialise the GPU due to PCI bus address space issues. CC7 guests usually work fine. The root cause of this is under investigation. There are two known work arounds:
- Boot the guest in BIOS mode instead of UEFI. See the documentation how this can be changed.
- Add the kernel boot options
pci=realloc pci=nocrs
at boot time.
GPU is not found on guests with a virtual GPU
Check if the GPU has been properly attached to the VM by running
If that is fine, make sure that the machine has been rebooted after the drivers have been installed. If not, reboot.The vGPU works but has a performance which makes it unusable
This typically means that the GPU is not licensed. You can check with
which should return something like this: If nothing is returned, or if the license is expired, check that you have the license key installed in the right place, usually/etc/nvidia/ClientConfigToken
, that there are no other files in that folder. If this looks correct, please verify the /etc/nvidia/gridd.conf exists and has been configured correctly, specifically:
- ServerAddress=188.184.22.97
- BackupServerAddress=188.185.91.185
- ServerPort=7070
- BackupServerPort=7070
- FeatureType=4
If you have done any changes, restart gridd
or reboot. Also, check that gridd is actually running and did not return any errors:GPU accelerated Docker containers
The NVIDIA Container Toolkit is required to run GPU accelerated Docker containers (not required for Kubernetes). Installation instructions are available here.
Additional resources:
- NVIDIA Driver Downloads (all products and OS)
- Tesla Driver release notes
- Tesla NVIDIA Driver Installation Quickstart Guide (Linux)
- CUDA Quick Start Guide (Linux and Windows)