Introduction
Running AI models in production requires scalable, reliable infrastructure. Kubernetes with GPU support provides auto-scaling, health monitoring, and resource management for GPU workloads. This guide walks you through setting up a production-ready Kubernetes GPU cluster.
Prerequisites
- Kubernetes 1.28+ cluster (EKS, GKE, or bare metal)
- NVIDIA GPUs (A100, H100, or B200)
- kubectl and Helm installed
- Basic Kubernetes knowledge
Step 1: Install NVIDIA GPU Operator
The GPU Operator automates the management of all NVIDIA software components:
# Add the NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
# Install GPU Operator
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--set driver.enabled=true \
--set toolkit.enabled=trueVerify GPU detection:
kubectl get nodes -o json | jq '.items[].status.capacity["nvidia.com/gpu"]'Step 2: Configure GPU Resource Scheduling
# gpu-resource-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-quota
namespace: ai-serving
spec:
hard:
requests.nvidia.com/gpu: "8"
limits.nvidia.com/gpu: "8"Step 3: Deploy Triton Inference Server
# triton-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: triton-inference
namespace: ai-serving
spec:
replicas: 2
selector:
matchLabels:
app: triton
template:
metadata:
labels:
app: triton
spec:
containers:
- name: triton
image: nvcr.io/nvidia/tritonserver:24.01-py3
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
memory: "32Gi"
cpu: "8"
ports:
- containerPort: 8000
name: http
- containerPort: 8001
name: grpc
volumeMounts:
- name: model-store
mountPath: /models
volumes:
- name: model-store
persistentVolumeClaim:
claimName: model-storageStep 4: Auto-Scaling Configuration
# gpu-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: triton-hpa
namespace: ai-serving
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: triton-inference
minReplicas: 1
maxReplicas: 8
metrics:
- type: Pods
pods:
metric:
name: gpu_utilization
target:
type: AverageValue
averageValue: "70"Step 5: Monitoring
Deploy Prometheus and Grafana for GPU monitoring:
# Install kube-prometheus-stack
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace
# Install DCGM exporter for GPU metrics
helm install dcgm-exporter nvidia/dcgm-exporter \
--namespace monitoringKey metrics to monitor:
- GPU utilization per pod
- GPU memory usage
- Inference latency (P50, P95, P99)
- Request throughput
- Queue depth
Troubleshooting
- GPU not detected: Ensure GPU Operator pods are running, check node labels
- OOM kills: Increase memory limits, or reduce model batch size
- Slow inference: Check GPU utilization, ensure models are optimized with TensorRT
- Scaling issues: Verify HPA metrics are being collected, check DCGM exporter
Conclusion
A Kubernetes GPU cluster provides the foundation for scalable AI model serving. With proper auto-scaling, monitoring, and resource management, you can serve AI models reliably to thousands of concurrent users.
Key Takeaways
- GPU Operator simplifies NVIDIA driver and toolkit management
- Triton Inference Server supports multiple model frameworks
- Auto-scaling based on GPU utilization optimizes cost
- Monitoring is critical for production GPU workloads