TutorialAdvanced

Setting Up a Kubernetes GPU Cluster for AI Model Serving

Deploy and manage GPU workloads on Kubernetes with NVIDIA GPU Operator and Triton Inference Server.

AIcloud2026-01-3020 min read

Introduction

Running AI models in production requires scalable, reliable infrastructure. Kubernetes with GPU support provides auto-scaling, health monitoring, and resource management for GPU workloads. This guide walks you through setting up a production-ready Kubernetes GPU cluster.

Prerequisites

  • Kubernetes 1.28+ cluster (EKS, GKE, or bare metal)
  • NVIDIA GPUs (A100, H100, or B200)
  • kubectl and Helm installed
  • Basic Kubernetes knowledge

Step 1: Install NVIDIA GPU Operator

The GPU Operator automates the management of all NVIDIA software components:

bash
# Add the NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# Install GPU Operator
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true

Verify GPU detection:

bash
kubectl get nodes -o json | jq '.items[].status.capacity["nvidia.com/gpu"]'

Step 2: Configure GPU Resource Scheduling

yaml
# gpu-resource-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: ai-serving
spec:
  hard:
    requests.nvidia.com/gpu: "8"
    limits.nvidia.com/gpu: "8"

Step 3: Deploy Triton Inference Server

yaml
# triton-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: triton-inference
  namespace: ai-serving
spec:
  replicas: 2
  selector:
    matchLabels:
      app: triton
  template:
    metadata:
      labels:
        app: triton
    spec:
      containers:
      - name: triton
        image: nvcr.io/nvidia/tritonserver:24.01-py3
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            nvidia.com/gpu: 1
            memory: "32Gi"
            cpu: "8"
        ports:
        - containerPort: 8000
          name: http
        - containerPort: 8001
          name: grpc
        volumeMounts:
        - name: model-store
          mountPath: /models
      volumes:
      - name: model-store
        persistentVolumeClaim:
          claimName: model-storage

Step 4: Auto-Scaling Configuration

yaml
# gpu-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: triton-hpa
  namespace: ai-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: triton-inference
  minReplicas: 1
  maxReplicas: 8
  metrics:
  - type: Pods
    pods:
      metric:
        name: gpu_utilization
      target:
        type: AverageValue
        averageValue: "70"

Step 5: Monitoring

Deploy Prometheus and Grafana for GPU monitoring:

bash
# Install kube-prometheus-stack
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace

# Install DCGM exporter for GPU metrics
helm install dcgm-exporter nvidia/dcgm-exporter \
  --namespace monitoring

Key metrics to monitor:

  • GPU utilization per pod
  • GPU memory usage
  • Inference latency (P50, P95, P99)
  • Request throughput
  • Queue depth

Troubleshooting

  • GPU not detected: Ensure GPU Operator pods are running, check node labels
  • OOM kills: Increase memory limits, or reduce model batch size
  • Slow inference: Check GPU utilization, ensure models are optimized with TensorRT
  • Scaling issues: Verify HPA metrics are being collected, check DCGM exporter

Conclusion

A Kubernetes GPU cluster provides the foundation for scalable AI model serving. With proper auto-scaling, monitoring, and resource management, you can serve AI models reliably to thousands of concurrent users.

Key Takeaways

  • GPU Operator simplifies NVIDIA driver and toolkit management
  • Triton Inference Server supports multiple model frameworks
  • Auto-scaling based on GPU utilization optimizes cost
  • Monitoring is critical for production GPU workloads
KubernetesGPUInfrastructureMLOps

Related Articles

Stay Ahead in AI

Get the latest AI tutorials, tools, and news delivered to your inbox every week.

Join 12,000+ AI developers