Skip to main content

Command Palette

Search for a command to run...

πŸš€ How to Enable NVIDIA MIG and Deploy AI Workloads on Kubernetes – Step-by-Step Guide (with Code Examples) πŸ§‘β€πŸ’»

Updated
β€’5 min read

Introduction to NVIDIA MIG: GPU Efficiency for AI Workloads

Modern AI and deep learning workloads are pushing the boundaries of compute infrastructure, and GPU utilization often becomes the bottleneck for scaling, cost-effectiveness, and resource sharing. This is where NVIDIA’s Multi-Instance GPU (MIG) technology steps in as a true game-changer! πŸš€

What is NVIDIA MIG?

NVIDIA MIG (Multi-Instance GPU) is an innovative technology introduced with the NVIDIA A100 (Ampere architecture) and later GPUs. It enables a single physical GPU to be partitioned into multiple, completely isolated GPU instancesβ€”each with its own dedicated memory, cache, and compute resources.

In simple terms:

One big GPU β†’ Multiple smaller GPUs, each behaving like a standalone device! ✨

Why Use MIG?

  • Resource Optimization: Run multiple jobs or containers on a single GPU, eliminating resource wastage.

  • Workload Isolation: Each instance is fully isolated for security and stabilityβ€”no noisy neighbors!

  • Scalability: Ideal for shared AI clusters, multi-user environments, and mixed training/inference deployments.

  • Cost Savings: Maximize your GPU investment by supporting more users and workloads per server. πŸ’°

In this hands-on guide, we’ll learn how to:

  • Enable MIG on your NVIDIA GPU

  • Configure and manage MIG profiles

  • Expose MIG profiles to Kubernetes

  • Deploy an AI workload using MIG in K8s

  • 🟒 Test and verify setup with PyTorch

Let’s roll up our sleeves and get started!

🧐 Why Use NVIDIA MIG?

With MIG, you can partition one massive GPU into multiple smaller, fully isolated GPUs. This is a game-changer for:

  • Multi-user AI platforms

  • ML training + inference in parallel

  • Improving GPU utilization πŸ’Έ

πŸ”Ž Prerequisites

  • NVIDIA A100 (or other MIG-capable GPU)

  • Ubuntu 22.04+ with nvidia-driver-570 or higher

  • Docker & Kubernetes cluster (with GPU nodes)

Check The NVIDIA Driver

Make sure there is right driver installed:

sudo dpkg -l | grep nvidia-driver-570

Output should look like:

ii  nvidia-driver-570  570.133.07-0ubuntu0.22.04.1  amd64  NVIDIA driver metapackage

Check your GPU visibility:

sudo nvidia-smi

Enable MIG Mode 🚦

Run this command to turn on MIG:

sudo nvidia-smi -mig 1

Verify it’s enabled:

sudo nvidia-smi --query-gpu=mig.mode.current --format=csv,noheader,nounits

Expected output:

Enabled

List Available MIG Profiles

Find out which profiles you can use:

sudo nvidia-smi mig -lgip

Create a MIG Profile πŸ’Ύ

For example, to create a 1g.10gb MIG profile on GPU 0:

sudo nvidia-smi mig -cgi 1g.10gb -i 0

Check the created instance ID:

sudo nvidia-smi mig -lgi

Create a compute instance (replace XX with your actual GI ID):

sudo nvidia-smi mig -cci -gi XX -i 0

List All MIG Devices and the Main GPU

sudo nvidia-smi -L

Example output:

lessCopyEditGPU 0: NVIDIA A100 XXGB PCIe (UUID: ...)
  MIG 1g.10gb     Device  0: (UUID: ...)

Enable MIG Support in NVIDIA Device Plugin for Kubernetes ☸️

create nvidia-toolkit-mig.yml (DaemonSet for GPU plugin):

yamlCopyEditapiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  template:
    spec:
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.17.1
        name: nvidia-device-plugin-ctr
        args:
        - --mig-strategy=mixed
        - --fail-on-init-error=false
        - --pass-device-specs=true
        env:
        - name: MIG_STRATEGY
          value: "mixed"
        - name: NVIDIA_MIG_MONITOR_DEVICES
          value: "all"
        - name: NVIDIA_DRIVER_CAPABILITIES
          value: "all"
        - name: NVIDIA_VISIBLE_DEVICES
          value: "all"
        securityContext:
          privileged: true
          capabilities:
            add: ["SYS_ADMIN"]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
        - name: dev
          mountPath: /dev
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins
      - name: dev
        hostPath:
          path: /dev
      nodeSelector:
        kubernetes.io/os: linux

Apply the DaemonSet:

sudo kubectl apply -f nvidia-toolkit-mig.yml

Confirm MIG Resources on K8s Node 🚦

Run:

sudo kubectl describe node <your-node> | grep -A 10 "Allocatable:" | grep nvidia

Example output:

nvidia.com/gpu:          0
nvidia.com/mig-1g.10gb:  1

πŸŽ‰ Now your cluster sees the MIG as an allocatable resource!

Deploy and Test an AI Workload πŸš€

Here’s a sample pod manifest to run a PyTorch job on your MIG device.

yamlCopyEditapiVersion: v1
kind: Pod
metadata:
  name: pytorch-mig-training
  labels:
    app: ai-training
spec:
  restartPolicy: Never
  containers:
  - name: pytorch-container
    image: nvcr.io/nvidia/pytorch:23.08-py3
    resources:
      requests:
        nvidia.com/mig-1g.10gb: 1
        cpu: "2"
        memory: "4Gi"
      limits:
        nvidia.com/mig-1g.10gb: 1
        cpu: "4"
        memory: "8Gi"
    command: ["/bin/bash", "-c"]
    args:
    - |
      echo "=== MIG Device Information ==="
      nvidia-smi
      echo ""
      echo "=== MIG Device List ==="
      nvidia-smi -L
      echo ""
      echo "=== Testing PyTorch with MIG ==="
      python3 -c "
      import torch
      # ... (full test code, see above for complete script)
      "
      echo "=== Training completed successfully! ==="
      sleep 300
    env:
    - name: NVIDIA_VISIBLE_DEVICES
      value: all
    - name: NVIDIA_DRIVER_CAPABILITIES
      value: all

Apply it:

sudo kubectl apply -f pod-test-training.yaml

View logs:

sudo kubectl logs -f pytorch-mig-training

πŸ“Š Sample Output (What Should be there)

=== MIG Device Information ===
...
Device Name: NVIDIA A100 XXGB PCIe MIG 1g.10gb
Device Memory: 9.5 GB

=== Creating a Simple Neural Network ===
Model created and moved to MIG device

=== Training Loop (10 iterations) ===
Epoch 1/10, Loss: ...
...

=== Testing Large Matrix Operations ===
Matrix multiplication (2000x2000) completed in 0.06 seconds
Result tensor device: cuda:0

GPU Memory Usage:
Allocated: 0.06 GB
Cached: 0.07 GB

πŸ‘ Training completed successfully!

🏁 Wrapping Up & Troubleshooting

With MIG, you can run multiple isolated AI jobs on the same physical GPU, maximizing utilization, reducing costs, and making your infrastructure much more efficient! πŸ₯³

If you encounter issues:

  • Double-check driver versions and MIG status

  • Make sure K8s nodes are properly labeled for GPU scheduling

  • Validate your DaemonSet configuration

πŸ’‘ Pro Tips

  • Resetting MIG: If something goes wrong, delete all MIGs and start over!

    Delete All MIG Instances ❌

    If you want to reset MIGs:

      sudo nvidia-smi mig -dgi -i 0
    
  • Monitoring: Use nvidia-smi and kubectl for real-time status.

  • Scaling: Mix and match MIG profiles for different workloads – that's the real magic! ✨