Configure AKS GPU Nodes for LLM Workloads

Deploy LLMs on Azure Kubernetes Service with native GPU support, dynamic autoscaling, and full Kubernetes control, using V100, A100, or H100 nodes to run custom inference frameworks at scale while optimizing costs with spot and reserved instances.

Configure AKS GPU Nodes for LLM Workloads

TLDR;

  • AKS Autopilot manages node sizing, upgrades, and security patches automatically
  • Spot instances provide 60-80% discount for development and fault-tolerant workloads
  • NVIDIA DCGM Exporter tracks GPU utilization, memory, and temperature per pod
  • 1-year reserved instances save 40% for predictable production workloads

Introduction

Azure Kubernetes Service provides enterprise-grade Kubernetes with native GPU support for demanding LLM workloads. Deploy any model architecture without platform constraints. Scale dynamically from zero to hundreds of GPU nodes based on demand. Pay only for resources consumed. AKS eliminates cluster management complexity while maintaining full control over deployment configurations.

Managed Kubernetes includes automatic upgrades and security patches. GPU node pools support V100, A100, and H100 accelerators. Auto-scaling operates at both cluster and pod levels. Native Azure integration connects seamlessly with Container Registry, Key Vault, and Azure Monitor. Multi-tenancy support enables isolated environments for different teams. Cost optimization tools help control cloud spending. This guide covers GPU node pool creation, LLM deployment configurations, horizontal and cluster autoscaling, persistent storage for models, GPU monitoring with DCGM exporter, and cost reduction strategies. AKS works best when you need Kubernetes flexibility, run multiple models simultaneously, use custom inference frameworks, implement complex deployment patterns, support multi-team environments, or maintain hybrid cloud requirements. Organizations choose AKS for production LLM deployments requiring maximum control and customization compared to fully managed alternatives.

GPU Node Pool Configuration

Create dedicated GPU node pools optimized for different workload types.

Create AKS cluster with system node pool:

# Create resource group
az group create \
    --name ml-production \
    --location eastus

# Create cluster
az aks create \
    --resource-group ml-production \
    --name llm-cluster \
    --node-count 2 \
    --node-vm-size Standard_D4s_v3 \
    --enable-managed-identity \
    --generate-ssh-keys \
    --network-plugin azure \
    --enable-cluster-autoscaler \
    --min-count 1 \
    --max-count 5

Add V100 node pool for development:

az aks nodepool add \
    --resource-group ml-production \
    --cluster-name llm-cluster \
    --name v100pool \
    --node-count 1 \
    --node-vm-size Standard_NC6s_v3 \
    --enable-cluster-autoscaler \
    --min-count 0 \
    --max-count 3 \
    --node-taints sku=gpu:NoSchedule \
    --labels workload=ml hardware=gpu sku=v100

Add A100 node pool for production:

az aks nodepool add \
    --resource-group ml-production \
    --cluster-name llm-cluster \
    --name a100pool \
    --node-count 2 \
    --node-vm-size Standard_NC24ads_A100_v4 \
    --enable-cluster-autoscaler \
    --min-count 1 \
    --max-count 10 \
    --node-taints sku=gpu:NoSchedule \
    --labels workload=ml hardware=gpu sku=a100

Install NVIDIA device plugin:

# Get credentials
az aks get-credentials \
    --resource-group ml-production \
    --name llm-cluster

# Install plugin
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/main/nvidia-device-plugin.yml

# Verify GPUs detected
kubectl get nodes -o custom-columns=NAME:.metadata.name,GPUs:.status.capacity.'nvidia\.com/gpu'

Model Deployment and Auto-Scaling

Deploy LLM with GPU resources and configure horizontal pod autoscaling.

# llama-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-inference
  namespace: ml-production
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llama-inference
  template:
    metadata:
      labels:
        app: llama-inference
    spec:
      nodeSelector:
        workload: ml
        sku: a100
      tolerations:
      - key: sku
        operator: Equal
        value: gpu
        effect: NoSchedule
      containers:
      - name: model-server
        image: your-acr.azurecr.io/llama-vllm:latest
        ports:
        - containerPort: 8000
        resources:
          requests:
            nvidia.com/gpu: 1
            cpu: "8"
            memory: "64Gi"
          limits:
            nvidia.com/gpu: 1
            cpu: "16"
            memory: "128Gi"
        env:
        - name: MODEL_PATH
          value: "/models/llama-70b"
        - name: GPU_MEMORY_UTILIZATION
          value: "0.95"
        volumeMounts:
        - name: model-storage
          mountPath: /models
          readOnly: true
      volumes:
      - name: model-storage
        persistentVolumeClaim:
          claimName: model-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: llama-service
spec:
  selector:
    app: llama-inference
  ports:
  - port: 80
    targetPort: 8000
  type: LoadBalancer

Configure horizontal pod autoscaler:

# llama-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llama-hpa
  namespace: ml-production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llama-inference
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30
      selectPolicy: Max

Configure cluster autoscaler:

# Update node pool autoscaler
az aks nodepool update \
    --resource-group ml-production \
    --cluster-name llm-cluster \
    --name a100pool \
    --update-cluster-autoscaler \
    --min-count 1 \
    --max-count 15

# Configure scale-down parameters
az aks update \
    --resource-group ml-production \
    --name llm-cluster \
    --cluster-autoscaler-profile \
        scale-down-delay-after-add=10m \
        scale-down-unneeded-time=10m \
        scale-down-utilization-threshold=0.5

Storage and GPU Monitoring

Configure persistent storage for models and monitor GPU utilization.

Azure Files for model storage:

# model-storage.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: model-pv
spec:
  capacity:
    storage: 500Gi
  accessModes:
  - ReadOnlyMany
  storageClassName: azurefile-premium
  azureFile:
    secretName: azure-storage-secret
    shareName: llm-models
    readOnly: true
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-pvc
  namespace: ml-production
spec:
  accessModes:
  - ReadOnlyMany
  storageClassName: azurefile-premium
  resources:
    requests:
      storage: 500Gi

NVIDIA DCGM Exporter for GPU monitoring:

# dcgm-exporter.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-dcgm-exporter
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter
  template:
    metadata:
      labels:
        app: nvidia-dcgm-exporter
    spec:
      nodeSelector:
        hardware: gpu
      tolerations:
      - key: sku
        operator: Equal
        value: gpu
        effect: NoSchedule
      containers:
      - name: nvidia-dcgm-exporter
        image: nvcr.io/nvidia/k8s/dcgm-exporter:3.1.8-3.1.5-ubuntu20.04
        ports:
        - containerPort: 9400
          name: metrics
        securityContext:
          capabilities:
            add:
            - SYS_ADMIN
        volumeMounts:
        - name: pod-gpu-resources
          readOnly: true
          mountPath: /var/lib/kubelet/pod-resources
      volumes:
      - name: pod-gpu-resources
        hostPath:
          path: /var/lib/kubelet/pod-resources

Cost Optimization Strategies

Reduce GPU infrastructure costs through spot instances and reserved capacity.

Create spot instance node pool:

az aks nodepool add \
    --resource-group ml-production \
    --cluster-name llm-cluster \
    --name a100spot \
    --priority Spot \
    --eviction-policy Delete \
    --spot-max-price -1 \
    --node-vm-size Standard_NC24ads_A100_v4 \
    --enable-cluster-autoscaler \
    --min-count 0 \
    --max-count 5 \
    --node-taints kubernetes.azure.com/scalesetpriority=spot:NoSchedule \
    --labels priority=spot workload=ml

Spot instances provide 60-80% discount. Regular A100 instance costs $3.67/hour. Spot instance costs approximately $1.10/hour. Save $2.57/hour or $1,886/month per instance for non-critical workloads.

Deploy workloads to spot nodes:

spec:
  template:
    spec:
      nodeSelector:
        priority: spot
      tolerations:
      - key: kubernetes.azure.com/scalesetpriority
        operator: Equal
        value: spot
        effect: NoSchedule

Purchase reserved instances for predictable workloads:

# 1-year reservation saves 40%
az reservations reservation-order purchase \
    --reservation-order-id "order-id" \
    --sku Standard_NC24ads_A100_v4 \
    --location eastus \
    --quantity 3 \
    --term P1Y

Scheduled scaling reduces costs during off-hours:

# scheduled-scaler.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: scale-down-night
spec:
  schedule: "0 22 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: cluster-scaler
          containers:
          - name: kubectl
            image: bitnami/kubectl:latest
            command:
            - /bin/sh
            - -c
            - kubectl scale deployment llama-inference --replicas=1
          restartPolicy: OnFailure

Instance type comparison helps optimize costs. Standard_NC6s_v3 with V100 16GB costs $3.06/hour and provides approximately 80 tokens/second throughput. Good for 7B-13B models and development. Standard_NC24ads_A100_v4 with A100 80GB costs $3.67/hour and provides approximately 150 tokens/second. Best value for most production workloads running 13B-70B models. Standard_ND96asr_v4 with 8x A100 costs $27.20/hour and provides approximately 600 tokens/second. Required for 70B+ models needing high throughput.

Conclusion

AKS GPU configuration provides production-ready infrastructure for LLM deployments requiring Kubernetes flexibility and control. Dedicated GPU node pools isolate workloads by performance tier. Horizontal pod autoscaling and cluster autoscaling respond to traffic dynamically. Persistent storage enables efficient model loading. NVIDIA DCGM Exporter tracks GPU utilization and temperature. Spot instances reduce costs by 60-80% for fault-tolerant workloads. Reserved instances provide 40% savings for predictable production workloads. Organizations running multiple models or requiring custom deployment patterns choose AKS over fully managed alternatives. Start with NC24ads_A100_v4 node pool for balanced cost-performance. Enable autoscaling with appropriate thresholds. Implement GPU monitoring for visibility. Use spot instances for development and reserved instances for production. Your Kubernetes deployment scales efficiently while optimizing infrastructure costs.