Auto-Scale GPU Workloads on GKE Clusters

Configure Google Kubernetes Engine GPU autoscaling for production LLM deployments. Set up dynamic scaling, optimize costs with spot VMs, and maintain performance through intelligent autoscaling policies.

Auto-Scale GPU Workloads on GKE Clusters

TLDR;

  • Spot VMs reduce A100 costs from $12/hour to $3-5/hour (60-75% savings)
  • HPA scales pods on CPU, memory, and custom metrics like queue depth
  • Cluster Autoscaler provisions GPU nodes in 3-8 minutes when pods are pending
  • GPU time-sharing runs 4 small models per GPU for development workloads

Introduction

Google Kubernetes Engine (GKE) provides enterprise-grade Kubernetes with sophisticated GPU autoscaling for Large Language Model workloads. Unlike managed services that abstract infrastructure control, GKE enables fine-grained optimization while automating operational tasks like node provisioning, pod scheduling, and traffic distribution. This control matters for LLM deployments where GPU costs dominate infrastructure spending and workload patterns vary unpredictably.

GKE's multi-layer autoscaling architecture handles different scaling dimensions simultaneously. Horizontal Pod Autoscaler (HPA) adjusts replica counts based on CPU, memory, or custom metrics like queue depth. Cluster Autoscaler provisions or removes nodes when pods cannot schedule due to resource constraints. Vertical Pod Autoscaler (VPA) right-sizes resource requests based on actual usage patterns, preventing over-provisioning waste.

This guide covers GKE cluster setup with GPU node pools, HPA configuration for LLM inference pods, spot VM integration for cost optimization, GPU time-sharing for multi-tenant workloads, and monitoring strategies. You'll learn to balance responsiveness with cost efficiency, handle spot instance interruptions gracefully, and scale from zero to hundreds of GPUs based on real-time demand. These patterns enable production LLM serving at lower costs than always-on infrastructure while maintaining sub-second scaling response times.

Cluster Setup and Node Pool Configuration

Create GKE clusters with GPU-enabled node pools configured for autoscaling and spot instance integration.

Create GPU-Enabled Cluster

# Set variables
PROJECT_ID="your-project-id"
REGION="us-central1"
CLUSTER_NAME="llm-cluster"

# Create cluster with autoscaling
gcloud container clusters create $CLUSTER_NAME \
    --region=$REGION \
    --machine-type=n1-standard-8 \
    --num-nodes=1 \
    --enable-autoscaling \
    --min-nodes=1 \
    --max-nodes=10 \
    --enable-autorepair \
    --enable-autoupgrade \
    --addons=GcePersistentDiskCsiDriver

# Add GPU node pool
gcloud container node-pools create gpu-pool \
    --cluster=$CLUSTER_NAME \
    --region=$REGION \
    --machine-type=n1-highmem-8 \
    --accelerator=type=nvidia-tesla-a100,count=2 \
    --num-nodes=0 \
    --enable-autoscaling \
    --min-nodes=0 \
    --max-nodes=10 \
    --node-taints=nvidia.com/gpu=present:NoSchedule \
    --node-labels=workload=ml,gpu=a100

# Install NVIDIA GPU device plugin
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

Horizontal Pod Autoscaler

Scale pods based on metrics.

Deploy LLM with HPA

# llama-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-inference
  namespace: production
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llama-inference
  template:
    metadata:
      labels:
        app: llama-inference
    spec:
      nodeSelector:
        workload: ml
        gpu: a100
      tolerations:
      - key: nvidia.com/gpu
        operator: Equal
        value: present
        effect: NoSchedule
      containers:
      - name: model-server
        image: gcr.io/your-project/llama-vllm:latest
        ports:
        - containerPort: 8000
        resources:
          requests:
            nvidia.com/gpu: 2
            cpu: "8"
            memory: "64Gi"
          limits:
            nvidia.com/gpu: 2
            cpu: "16"
            memory: "128Gi"
        env:
        - name: MODEL_PATH
          value: "/models/llama-70b"
        - name: TENSOR_PARALLEL_SIZE
          value: "2"
---
apiVersion: v1
kind: Service
metadata:
  name: llama-service
  namespace: production
spec:
  selector:
    app: llama-inference
  ports:
  - port: 80
    targetPort: 8000
  type: LoadBalancer
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llama-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llama-inference
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  - type: Pods
    pods:
      metric:
        name: inference_queue_length
      target:
        type: AverageValue
        averageValue: "10"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30
      - type: Pods
        value: 2
        periodSeconds: 30
      selectPolicy: Max

Custom Metrics for Autoscaling

Export application-specific metrics like queue depth to Cloud Monitoring using the Monitoring API. Configure HPA to scale based on these custom metrics for more responsive autoscaling than CPU/memory alone.

Cluster Autoscaler Configuration

Automatically add/remove nodes.

Configure Cluster Autoscaler

# Update cluster autoscaler settings
gcloud container clusters update $CLUSTER_NAME \
    --region=$REGION \
    --enable-autoscaling \
    --min-nodes=1 \
    --max-nodes=15 \
    --autoscaling-profile=optimize-utilization

# Configure autoscaler for GPU pool
gcloud container node-pools update gpu-pool \
    --cluster=$CLUSTER_NAME \
    --region=$REGION \
    --enable-autoscaling \
    --min-nodes=0 \
    --max-nodes=20

Cost Optimization with Spot VMs

Reduce costs by 60-91%.

Create Spot GPU Node Pool

# Create spot instance pool
gcloud container node-pools create gpu-spot-pool \
    --cluster=$CLUSTER_NAME \
    --region=$REGION \
    --machine-type=n1-highmem-8 \
    --accelerator=type=nvidia-tesla-a100,count=2 \
    --spot \
    --num-nodes=0 \
    --enable-autoscaling \
    --min-nodes=0 \
    --max-nodes=15 \
    --node-taints=nvidia.com/gpu=present:NoSchedule,cloud.google.com/gke-spot=true:NoSchedule \
    --node-labels=workload=ml,gpu=a100,spot=true

Deploy with Spot Tolerance

spec:
  template:
    spec:
      nodeSelector:
        spot: "true"
      tolerations:
      - key: nvidia.com/gpu
        operator: Equal
        value: present
        effect: NoSchedule
      - key: cloud.google.com/gke-spot
        operator: Equal
        value: "true"
        effect: NoSchedule
      # Handle spot termination gracefully
      terminationGracePeriodSeconds: 30

Savings calculation:

  • On-demand (n1-highmem-8 + 2x A100): $12/hour
  • Spot instance: ~$3-5/hour (60-75% discount)
  • Monthly savings: ~$5,000 per instance

GPU Time-Sharing

Run multiple workloads per GPU.

Enable GPU Sharing

Configure GPU time-slicing to share single GPUs among multiple pods. Set replicas to 4 for sharing one GPU across 4 workloads. Apply via ConfigMap and restart nodes. Best for development, small models (7B-13B), and cost-sensitive batch inference.

Monitoring GPU Utilization

Track performance and optimize.

DCGM Exporter

Deploy NVIDIA DCGM Exporter as a DaemonSet to collect GPU metrics from all nodes. Export metrics to Cloud Monitoring and Prometheus for tracking utilization, temperature, and power consumption. Use kubectl top nodes or query Cloud Monitoring for GPU duty cycle metrics.

Frequently Asked Questions

What's the best autoscaling strategy for production?

Recommended configuration:

  • HPA: Scale pods based on CPU (70%) and custom metrics (queue length)
  • Cluster Autoscaler: Add nodes when pods are pending
  • Mix of on-demand (baseline) and spot (burst capacity)

Example setup:

  • 2 on-demand GPU nodes (baseline traffic)
  • 0-15 spot GPU nodes (burst traffic)
  • Min 2 pod replicas, max 20 replicas
  • 5-minute scale-down delay

This provides stability with cost optimization.

How do I handle spot instance interruptions?

Best practices:

  1. Pod Disruption Budgets:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: llama-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: llama-inference
  1. Graceful termination:
import signal
import sys

def signal_handler(sig, frame):
    print("Spot termination detected, draining connections...")
    # Drain connections
    server.shutdown()
    sys.exit(0)

signal.signal(signal.SIGTERM, signal_handler)
  1. Spread across node pools:
  • Deploy critical workloads on on-demand
  • Use spot for scalable, fault-tolerant workloads

Can I autoscale based on custom metrics?

Yes, using Stackdriver Monitoring:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llama-custom-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llama-inference
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: External
    external:
      metric:
        name: custom.googleapis.com|inference|queue_length
        selector:
          matchLabels:
            resource.type: k8s_pod
      target:
        type: AverageValue
        averageValue: "10"

Supported custom metrics:

  • Queue length (Cloud Tasks, Pub/Sub)
  • Request latency (Cloud Trace)
  • Business metrics (custom exports)
  • Application-specific metrics

Conclusion

GKE GPU autoscaling enables production LLM deployments that balance cost efficiency with performance reliability. The multi-layer autoscaling architecture handles different scaling dimensions: HPA for pod replica management, Cluster Autoscaler for node provisioning, and VPA for resource right-sizing. Together, these systems respond to traffic changes within 3-8 minutes for GPU node provisioning plus 1-5 minutes for model loading.

Cost optimization through spot VMs delivers 60-75% savings versus on-demand instances. Combine 2-3 on-demand nodes for baseline capacity with 0-15 spot nodes for burst traffic. This hybrid approach maintains availability during spot interruptions while capturing significant cost savings. Pod Disruption Budgets ensure minimum replica counts during node replacements, preventing service degradation.

Start with HPA targeting 70% CPU utilization and custom metrics like queue depth for responsive scaling. Enable Cluster Autoscaler with optimize-utilization profile to balance bin-packing efficiency with provisioning speed. Monitor GPU utilization with DCGM Exporter and adjust resource requests based on actual usage patterns. Production GKE deployments achieve 40-60% lower costs than static over-provisioned infrastructure while maintaining sub-200ms P95 latency and 99.9% availability for LLM inference workloads.

Frequently Asked Questions

How quickly does GKE scale up for LLM traffic spikes?

GKE Cluster Autoscaler provisions new GPU nodes in 3-8 minutes. A10 nodes provision in 3-5 minutes, H100 nodes take 6-12 minutes. Add 30-90 seconds for pod scheduling and 60-300 seconds for model loading (Llama 7B: 60s, 70B: 240s). Total: 5-12 minutes spike-to-serving. For faster response, use CronJobs to pre-warm nodes before predictable traffic or maintain 2-3 minimum GPU nodes always-on for immediate serving while autoscaler provisions additional capacity.

What's the cost impact of over-provisioning versus under-provisioning?

Over-provisioning by 20% costs ~$1,000/month extra per A100 node but prevents latency spikes during traffic surges. Under-provisioning causes request queuing (P95 latency increases 3-10x) and potential timeouts. Balance through monitoring: target P95 latency <500ms and GPU utilization 60-75% for optimal provisioning. If utilization >85%, add capacity proactively. Use VPA to right-size deployments. For cost-sensitive workloads, tolerate brief latency degradation during scale-up rather than maintaining 30-40% overhead continuously.

Can I mix CPU and GPU node pools for different model sizes?

Yes. Deploy small models (Gemma 2B, Phi-2) on CPU-only node pools (n1-highmem-8 at $0.47/hour) for 80% cost savings, reserving A100 GPU pools for large models (Llama 70B). Use node selectors to route pods: accelerator: nvidia-tesla-a100 for GPU models, accelerator: none for CPU models. Typical setup: 2-4 CPU nodes serving 70% of requests, 1-2 GPU nodes scaling to 8 maximum for remaining 30% of traffic. This hybrid approach reduces total infrastructure costs by 40-60% versus GPU-only clusters.