Setting Up a Kubernetes GPU Cluster for AI Model Serving

Introduction

Running AI models in production requires scalable, reliable infrastructure. Kubernetes with GPU support provides auto-scaling, health monitoring, and resource management for GPU workloads. This guide walks you through setting up a production-ready Kubernetes GPU cluster.

Prerequisites

Kubernetes 1.28+ cluster (EKS, GKE, or bare metal)
NVIDIA GPUs (A100, H100, or B200)
kubectl and Helm installed
Basic Kubernetes knowledge

Step 1: Install NVIDIA GPU Operator

The GPU Operator automates the management of all NVIDIA software components:

bash

# Add the NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# Install GPU Operator
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true

Verify GPU detection:

bash

kubectl get nodes -o json | jq '.items[].status.capacity["nvidia.com/gpu"]'

Step 2: Configure GPU Resource Scheduling

yaml

# gpu-resource-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: ai-serving
spec:
  hard:
    requests.nvidia.com/gpu: "8"
    limits.nvidia.com/gpu: "8"

Step 3: Deploy Triton Inference Server

yaml

# triton-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: triton-inference
  namespace: ai-serving
spec:
  replicas: 2
  selector:
    matchLabels:
      app: triton
  template:
    metadata:
      labels:
        app: triton
    spec:
      containers:
      - name: triton
        image: nvcr.io/nvidia/tritonserver:24.01-py3
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            nvidia.com/gpu: 1
            memory: "32Gi"
            cpu: "8"
        ports:
        - containerPort: 8000
          name: http
        - containerPort: 8001
          name: grpc
        volumeMounts:
        - name: model-store
          mountPath: /models
      volumes:
      - name: model-store
        persistentVolumeClaim:
          claimName: model-storage

Step 4: Auto-Scaling Configuration

yaml

# gpu-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: triton-hpa
  namespace: ai-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: triton-inference
  minReplicas: 1
  maxReplicas: 8
  metrics:
  - type: Pods
    pods:
      metric:
        name: gpu_utilization
      target:
        type: AverageValue
        averageValue: "70"

Step 5: Monitoring

Deploy Prometheus and Grafana for GPU monitoring:

bash

# Install kube-prometheus-stack
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace

# Install DCGM exporter for GPU metrics
helm install dcgm-exporter nvidia/dcgm-exporter \
  --namespace monitoring

Key metrics to monitor:

GPU utilization per pod
GPU memory usage
Inference latency (P50, P95, P99)
Request throughput
Queue depth

Troubleshooting

GPU not detected: Ensure GPU Operator pods are running, check node labels
OOM kills: Increase memory limits, or reduce model batch size
Slow inference: Check GPU utilization, ensure models are optimized with TensorRT
Scaling issues: Verify HPA metrics are being collected, check DCGM exporter

Conclusion

A Kubernetes GPU cluster provides the foundation for scalable AI model serving. With proper auto-scaling, monitoring, and resource management, you can serve AI models reliably to thousands of concurrent users.

Key Takeaways

GPU Operator simplifies NVIDIA driver and toolkit management
Triton Inference Server supports multiple model frameworks
Auto-scaling based on GPU utilization optimizes cost
Monitoring is critical for production GPU workloads

Setting Up a Kubernetes GPU Cluster for AI Model Serving

Introduction

Prerequisites

Step 1: Install NVIDIA GPU Operator

Step 2: Configure GPU Resource Scheduling

Step 3: Deploy Triton Inference Server

Step 4: Auto-Scaling Configuration

Step 5: Monitoring

Troubleshooting

Conclusion

Key Takeaways

Related Articles

How to Deploy Llama 4 Locally with vLLM in 30 Minutes

NVIDIA Announces B300 GPU: 2x Performance for AI Training

Fine-Tuning LLMs in 2026: Best Practices and Common Pitfalls

Stay Ahead in AI