Clusters in the CERN cloud container service have built-in support to detect and configure NVIDIA GPUs. Follow these instructions to request access and quota for GPU resources.
This is officially supported for kubernetes cluster >=1.18.x.
Clusters with GPU resources should have the following label set:
That's it, the cluster deployment will handle the detection and configuration of GPU nodes.
This is workload dependent, but in general the container image should have the required nvidia drivers available.
In case the image expects the drivers to be available outside - an example of this is the default tensorflow image - you can also bind mount the drivers that are installed in the cluster nodes under /opt/nvidia-driver. Example using a Pod with the tensorflow image:
apiVersion: v1 kind: Pod metadata: name: tf-gpu spec: containers: - name: tf image: tensorflow/tensorflow:latest-gpu command: ["sleep", "inf"] resources: limits: nvidia.com/gpu: 1 env: - name: PATH value: "/bin:/usr/bin:/usr/local/bin:/opt/nvidia-driver/bin" - name: LD_LIBRARY_PATH value: "/opt/nvidia-driver/lib64" securityContext: privileged: true volumeMounts: - name: nvidia-driver mountPath: /opt/nvidia-driver volumes: - name: nvidia-driver hostPath: path: /opt/nvidia-driver
Check it all works from a python shell:
$ kubectl exec -it tf-gpu bash root@tf-gpu:/# python Python 3.6.9 (default, Jul 17 2020, 12:50:27) [GCC 8.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import tensorflow as tf >>> tf.test.is_gpu_available( cuda_only=False, min_cuda_compute_capability=None ) ... True