Using CDI (Container Device Interface) for NVIDIA GPUs with Podman

What is CDI (Container Device Interface)?

The Container Device Interface (CDI) is a specification that provides a standardized way to access NVIDIA GPU devices from within containers. It is the recommended and most reliable method for interfacing between Podman and NVIDIA GPUs, ensuring proper access to GPU hardware and drivers inside containers.

Note for AI/ML practitioners: If you’re working with large language models (LLMs) or other GPU-intensive AI workloads in containers, CDI is essential for optimal performance and stability with Podman.

When using NVIDIA GPUs with containerized applications—especially through Podman—CDI creates a consistent interface between the host system’s GPU drivers and the container environment, allowing containers to fully leverage GPU acceleration for CUDA applications, machine learning frameworks, and more.

Why Should You Use CDI with Podman?

CDI provides an efficient, consistent, and flexible way to manage and expose NVIDIA GPUs to Podman containers. Here’s why you should use it:

1. Podman’s Preferred GPU Integration Method

CDI is the officially supported mechanism for Podman to access NVIDIA GPUs.
It offers a more stable and maintainable approach than legacy methods like the --device flag.

2. GPU Access for Containers

CDI ensures that GPU resources are correctly passed through to containers, enabling GPU-accelerated workloads inside Podman containers.
It simplifies container runtime configuration when using NVIDIA GPUs.

3. Supports Modern Container Workflows

CDI works seamlessly with Podman’s rootless containers and pod-based deployments.
Provides consistent device access across different container configurations and deployment patterns.

4. Compatible with NVIDIA Drivers

The CDI specification works directly with NVIDIA drivers to allow containers to use GPU features like CUDA, machine learning, and high-performance computing tasks.

When Should You Use CDI with Podman?

You need to use CDI if you’re:

Running Podman as your container engine and need to access NVIDIA GPUs.
Working with large language models or other AI/ML workloads in containers that require GPU acceleration.
Running GPU-accelerated workloads (e.g., AI, deep learning, data science) within Podman containers.
Using NVIDIA GPUs on your system and want them to be accessible inside your Podman containerized environment.
Managing a production environment that requires NVIDIA GPU support within Podman containers.

Steps to Regenerate CDI Specification File

Why You Might Need to Regenerate the CDI File

If you’ve updated your NVIDIA driver, the CDI file may be referencing an older version, which could cause Podman container runtime errors. This is common when the NVIDIA driver version is updated but the CDI specification file remains outdated.

How to Regenerate the CDI File

To resolve this, you need to regenerate the CDI specification to match your updated NVIDIA driver.

Manual Steps

Remove the old CDI specification file
If the old CDI file exists, remove it to avoid conflicts:
```
sudo rm -f /etc/cdi/nvidia.yaml
```
Regenerate the CDI file
Using the NVIDIA toolkit, regenerate the CDI file to ensure it uses the correct driver version:
```
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
```
Verify the file has the correct driver version
Ensure the CDI file references the correct driver version by running:
```
cat /etc/cdi/nvidia.yaml | grep "570"
```
Restart Podman
Restart Podman to apply the changes:
```
sudo systemctl restart podman
```

Test GPU access in Podman
Verify that Podman can access the GPU:

podman run --device nvidia.com/gpu=all --rm nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi

Automating CDI File Regeneration

For production environments running Podman with NVIDIA GPUs, it’s best to automate this process so that the CDI specification file is always up-to-date after driver updates or container restarts. Below is a script that can be used for automation:

Automated Script Example

#!/bin/bash

# Define CDI file location
CDI_FILE="/etc/cdi/nvidia.yaml"

# Get current NVIDIA driver version
DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader)

echo "Current NVIDIA Driver Version: $DRIVER_VERSION"

# Check if CDI file exists
if [ -f "$CDI_FILE" ]; then
    # Check if the CDI file references an old driver version
    if ! grep -q "$DRIVER_VERSION" "$CDI_FILE"; then
        echo "Outdated CDI configuration detected. Regenerating..."
        sudo rm -f "$CDI_FILE"
        sudo nvidia-ctk cdi generate --output="$CDI_FILE"
        sudo systemctl restart podman
        echo "CDI file updated and Podman restarted."
    else
        echo "CDI file is already up-to-date."
    fi
else
    echo "CDI file missing. Generating..."
    sudo nvidia-ctk cdi generate --output="$CDI_FILE"
    sudo systemctl restart podman
    echo "CDI file created and Podman restarted."
fi

Setting Up Automated CDI Regeneration with a Systemd Timer

If a continuously running systemd service isn’t working well in your environment, you can instead use a systemd timer to run the script periodically or at boot. This approach only executes the script when needed while still providing robust logging and dependency management.

1. Create a systemd Service Unit (Oneshot Job)

Create the service unit file at /etc/systemd/system/nvidia-cdi-update.service:

[Unit]
Description=Regenerate NVIDIA CDI file if outdated

[Service]
Type=oneshot
ExecStart=/usr/local/bin/nvidia-cdi-update.sh

2. Create a systemd Timer Unit

Create the timer unit file at /etc/systemd/system/nvidia-cdi-update.timer:

[Unit]
Description=Run NVIDIA CDI update at boot and every hour

[Timer]
OnBootSec=5min
OnUnitActiveSec=1h
Persistent=true

[Install]
WantedBy=timers.target

3. Enable and Start the Timer

Reload systemd and enable the timer:

sudo systemctl daemon-reload
sudo systemctl enable --now nvidia-cdi-update.timer

This timer will run your CDI update script 5 minutes after boot and then every hour, ensuring your CDI file stays current without the need for a continuously running service.

Running LLMs with Podman and CDI

For those working with large language models in containers, here’s a quick example of how to use Podman with CDI to run an LLM inference container:

# Run an LLM inference container with GPU access
podman run --device nvidia.com/gpu=all \
  -v /path/to/models:/models \
  -p 8080:8080 \
  --rm \
  llm-inference-image:latest

Conclusion

Using CDI is the recommended and most reliable way to enable GPU-accelerated containers with NVIDIA GPUs in Podman. It ensures that your Podman containers can correctly access and utilize GPU resources for demanding workloads like large language models, AI training, and deep learning.

Regenerating the CDI specification file is necessary after updating your NVIDIA driver to avoid errors and ensure that Podman uses the correct configuration. Automating this process using a systemd timer in a production environment ensures that your system remains efficient and up-to-date with minimal manual intervention.

For teams working with large language models or other GPU-intensive applications, CDI provides the stability and performance needed to effectively leverage NVIDIA GPUs within Podman containers.