π How to Enable NVIDIA MIG and Deploy AI Workloads on Kubernetes β Step-by-Step Guide (with Code Examples) π§βπ»
Introduction to NVIDIA MIG: GPU Efficiency for AI Workloads
Modern AI and deep learning workloads are pushing the boundaries of compute infrastructure, and GPU utilization often becomes the bottleneck for scaling, cost-effectiveness, and resource sharing. This is where NVIDIAβs Multi-Instance GPU (MIG) technology steps in as a true game-changer! π
What is NVIDIA MIG?
NVIDIA MIG (Multi-Instance GPU) is an innovative technology introduced with the NVIDIA A100 (Ampere architecture) and later GPUs. It enables a single physical GPU to be partitioned into multiple, completely isolated GPU instancesβeach with its own dedicated memory, cache, and compute resources.
In simple terms:
One big GPU β Multiple smaller GPUs, each behaving like a standalone device! β¨
Why Use MIG?
Resource Optimization: Run multiple jobs or containers on a single GPU, eliminating resource wastage.
Workload Isolation: Each instance is fully isolated for security and stabilityβno noisy neighbors!
Scalability: Ideal for shared AI clusters, multi-user environments, and mixed training/inference deployments.
Cost Savings: Maximize your GPU investment by supporting more users and workloads per server. π°
In this hands-on guide, weβll learn how to:
Enable MIG on your NVIDIA GPU
Configure and manage MIG profiles
Expose MIG profiles to Kubernetes
Deploy an AI workload using MIG in K8s
π’ Test and verify setup with PyTorch
Letβs roll up our sleeves and get started!
π§ Why Use NVIDIA MIG?
With MIG, you can partition one massive GPU into multiple smaller, fully isolated GPUs. This is a game-changer for:
Multi-user AI platforms
ML training + inference in parallel
Improving GPU utilization πΈ
π Prerequisites
NVIDIA A100 (or other MIG-capable GPU)
Ubuntu 22.04+ with nvidia-driver-570 or higher
Docker & Kubernetes cluster (with GPU nodes)
Check The NVIDIA Driver
Make sure there is right driver installed:
sudo dpkg -l | grep nvidia-driver-570
Output should look like:
ii nvidia-driver-570 570.133.07-0ubuntu0.22.04.1 amd64 NVIDIA driver metapackage
Check your GPU visibility:
sudo nvidia-smi
Enable MIG Mode π¦
Run this command to turn on MIG:
sudo nvidia-smi -mig 1
Verify itβs enabled:
sudo nvidia-smi --query-gpu=mig.mode.current --format=csv,noheader,nounits
Expected output:
Enabled
List Available MIG Profiles
Find out which profiles you can use:
sudo nvidia-smi mig -lgip
Create a MIG Profile πΎ
For example, to create a 1g.10gb MIG profile on GPU 0:
sudo nvidia-smi mig -cgi 1g.10gb -i 0
Check the created instance ID:
sudo nvidia-smi mig -lgi
Create a compute instance (replace XX with your actual GI ID):
sudo nvidia-smi mig -cci -gi XX -i 0
List All MIG Devices and the Main GPU
sudo nvidia-smi -L
Example output:
lessCopyEditGPU 0: NVIDIA A100 XXGB PCIe (UUID: ...)
MIG 1g.10gb Device 0: (UUID: ...)
Enable MIG Support in NVIDIA Device Plugin for Kubernetes βΈοΈ
create nvidia-toolkit-mig.yml (DaemonSet for GPU plugin):
yamlCopyEditapiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
template:
spec:
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.17.1
name: nvidia-device-plugin-ctr
args:
- --mig-strategy=mixed
- --fail-on-init-error=false
- --pass-device-specs=true
env:
- name: MIG_STRATEGY
value: "mixed"
- name: NVIDIA_MIG_MONITOR_DEVICES
value: "all"
- name: NVIDIA_DRIVER_CAPABILITIES
value: "all"
- name: NVIDIA_VISIBLE_DEVICES
value: "all"
securityContext:
privileged: true
capabilities:
add: ["SYS_ADMIN"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
- name: dev
mountPath: /dev
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
- name: dev
hostPath:
path: /dev
nodeSelector:
kubernetes.io/os: linux
Apply the DaemonSet:
sudo kubectl apply -f nvidia-toolkit-mig.yml
Confirm MIG Resources on K8s Node π¦
Run:
sudo kubectl describe node <your-node> | grep -A 10 "Allocatable:" | grep nvidia
Example output:
nvidia.com/gpu: 0
nvidia.com/mig-1g.10gb: 1
π Now your cluster sees the MIG as an allocatable resource!
Deploy and Test an AI Workload π
Hereβs a sample pod manifest to run a PyTorch job on your MIG device.
yamlCopyEditapiVersion: v1
kind: Pod
metadata:
name: pytorch-mig-training
labels:
app: ai-training
spec:
restartPolicy: Never
containers:
- name: pytorch-container
image: nvcr.io/nvidia/pytorch:23.08-py3
resources:
requests:
nvidia.com/mig-1g.10gb: 1
cpu: "2"
memory: "4Gi"
limits:
nvidia.com/mig-1g.10gb: 1
cpu: "4"
memory: "8Gi"
command: ["/bin/bash", "-c"]
args:
- |
echo "=== MIG Device Information ==="
nvidia-smi
echo ""
echo "=== MIG Device List ==="
nvidia-smi -L
echo ""
echo "=== Testing PyTorch with MIG ==="
python3 -c "
import torch
# ... (full test code, see above for complete script)
"
echo "=== Training completed successfully! ==="
sleep 300
env:
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
Apply it:
sudo kubectl apply -f pod-test-training.yaml
View logs:
sudo kubectl logs -f pytorch-mig-training
π Sample Output (What Should be there)
=== MIG Device Information ===
...
Device Name: NVIDIA A100 XXGB PCIe MIG 1g.10gb
Device Memory: 9.5 GB
=== Creating a Simple Neural Network ===
Model created and moved to MIG device
=== Training Loop (10 iterations) ===
Epoch 1/10, Loss: ...
...
=== Testing Large Matrix Operations ===
Matrix multiplication (2000x2000) completed in 0.06 seconds
Result tensor device: cuda:0
GPU Memory Usage:
Allocated: 0.06 GB
Cached: 0.07 GB
π Training completed successfully!
π Wrapping Up & Troubleshooting
With MIG, you can run multiple isolated AI jobs on the same physical GPU, maximizing utilization, reducing costs, and making your infrastructure much more efficient! π₯³
If you encounter issues:
Double-check driver versions and MIG status
Make sure K8s nodes are properly labeled for GPU scheduling
Validate your DaemonSet configuration
π‘ Pro Tips
Resetting MIG: If something goes wrong, delete all MIGs and start over!
Delete All MIG Instances β
If you want to reset MIGs:
sudo nvidia-smi mig -dgi -i 0Monitoring: Use
nvidia-smiandkubectlfor real-time status.Scaling: Mix and match MIG profiles for different workloads β that's the real magic! β¨