Deploy Production LLMs on OKE Kubernetes

Deploy LLMs on Oracle Kubernetes Engine with GPU support. Complete guide covers OKE cluster setup, GPU nodes, vLLM deployments, auto-scaling, and monitoring patterns.

Deploy Production LLMs on OKE Kubernetes

TLDR;

  • Zero control plane costs versus $73/month on AWS EKS and Azure AKS
  • 3x VM.GPU.A10.1 nodes handle 200-300 requests/min at $3,395/month total
  • HPA scales pods on CPU/memory with 5-minute scale-down stabilization
  • NVIDIA DCGM Exporter tracks GPU utilization, temperature, and power draw

Deploy large language models on Oracle Kubernetes Engine with GPU support. This guide covers OKE cluster setup from zero to production, GPU node pool configuration with A10 and A100 instances, and production-grade deployments using vLLM.

Kubernetes provides the orchestration layer needed for scalable LLM inference. OKE delivers fully managed Kubernetes clusters with zero control plane costs and native OCI service integration. Deploy containerized LLM workloads that scale from 2 to 20+ GPU nodes based on traffic demand.

Learn production patterns including GPU device plugins for proper resource allocation, horizontal pod autoscaling based on utilization metrics, zero-downtime rolling deployments for model updates, and monitoring stacks using Prometheus and Grafana. OKE clusters handle 200-300 requests per minute on small deployments and scale to thousands of requests for enterprise workloads.

OKE Architecture Overview

Oracle Kubernetes Engine provides a fully managed Kubernetes service optimized for GPU workloads. OKE integrates with OCI networking, storage, and security services for enterprise-grade deployments.

Key Components:

  • Control Plane: Fully managed Kubernetes masters (free)
  • Worker Nodes: GPU-enabled compute instances
  • Container Registry (OCIR): Private Docker registry
  • Load Balancer: Managed ingress with SSL termination
  • Block Storage: Persistent volumes for model storage
  • File Storage: Shared NFS for multi-pod access

Architecture Benefits:

  • Zero control plane costs
  • Native integration with OCI services
  • Automatic OS patching and updates
  • Support for mixed CPU/GPU node pools
  • Regional and multi-AD deployments
  • Built-in pod security policies

Create Production OKE Cluster

Set up a production-ready Kubernetes cluster with high availability.

# Create VCN for cluster
oci network vcn create \
    --compartment-id $COMPARTMENT_ID \
    --display-name llm-vcn \
    --cidr-blocks '["10.0.0.0/16"]' \
    --dns-label llmvcn

# Create subnets
oci network subnet create \
    --compartment-id $COMPARTMENT_ID \
    --vcn-id $VCN_ID \
    --display-name control-plane-subnet \
    --cidr-block 10.0.1.0/24 \
    --dns-label k8sapi

oci network subnet create \
    --compartment-id $COMPARTMENT_ID \
    --vcn-id $VCN_ID \
    --display-name worker-subnet \
    --cidr-block 10.0.10.0/24 \
    --dns-label workers

oci network subnet create \
    --compartment-id $COMPARTMENT_ID \
    --vcn-id $VCN_ID \
    --display-name loadbalancer-subnet \
    --cidr-block 10.0.20.0/24 \
    --dns-label loadbalancer

# Create OKE cluster
oci ce cluster create \
    --compartment-id $COMPARTMENT_ID \
    --name llm-production-cluster \
    --kubernetes-version v1.28.2 \
    --vcn-id $VCN_ID \
    --endpoint-subnet-id $CONTROL_PLANE_SUBNET_ID \
    --service-lb-subnet-ids "[$LB_SUBNET_ID]" \
    --cluster-pod-network-options '[{
        "cni-type": "FLANNEL_OVERLAY"
    }]' \
    --options '{
        "service-lb-config": {
            "subnet-ids": ["'$LB_SUBNET_ID'"]
        },
        "kubernetes-network-config": {
            "pods-cidr": "10.244.0.0/16",
            "services-cidr": "10.96.0.0/16"
        }
    }' \
    --wait-for-state ACTIVE

# Get kubeconfig
oci ce cluster create-kubeconfig \
    --cluster-id $CLUSTER_ID \
    --file ~/.kube/config \
    --region us-ashburn-1 \
    --token-version 2.0.0

# Verify cluster access
kubectl cluster-info
kubectl get nodes

Cluster creation time: 7-10 minutes

Configure GPU Node Pools

Add GPU-enabled worker nodes with optimized configurations.

Create GPU Node Pool:

# gpu-startup.sh - Node initialization script
cat << 'EOF' > gpu-startup.sh
#!/bin/bash
# Install NVIDIA drivers
dnf install -y kernel-devel-$(uname -r) kernel-headers-$(uname -r)
dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
dnf install -y nvidia-driver-535 nvidia-utils-535

# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.repo | tee /etc/yum.repos.d/nvidia-container-toolkit.repo
dnf install -y nvidia-container-toolkit
systemctl restart docker

# Verify GPU
nvidia-smi
EOF

# Create node pool with A100 GPUs
oci ce node-pool create \
    --cluster-id $CLUSTER_ID \
    --compartment-id $COMPARTMENT_ID \
    --name gpu-a100-pool \
    --node-shape VM.GPU.A100.1 \
    --node-source-details '{
        "source-type": "IMAGE",
        "image-id": "'$IMAGE_ID'"
    }' \
    --size 3 \
    --placement-configs '[{
        "availability-domain": "US-ASHBURN-AD-1",
        "subnet-id": "'$WORKER_SUBNET_ID'"
    }]' \
    --node-config-details '{
        "size": 3,
        "placement-configs": [{
            "availability-domain": "US-ASHBURN-AD-1",
            "subnet-id": "'$WORKER_SUBNET_ID'"
        }],
        "node-metadata": {
            "user_data": "'$(base64 -w 0 gpu-startup.sh)'"
        }
    }' \
    --node-shape-config '{
        "ocpus": 15,
        "memory-in-gbs": 240
    }' \
    --wait-for-state ACTIVE

# Create auto-scaling node pool for A10 GPUs
oci ce node-pool create \
    --cluster-id $CLUSTER_ID \
    --compartment-id $COMPARTMENT_ID \
    --name gpu-a10-autoscale \
    --node-shape VM.GPU.A10.1 \
    --size 2 \
    --placement-configs '[{
        "availability-domain": "US-ASHBURN-AD-1",
        "subnet-id": "'$WORKER_SUBNET_ID'"
    }]' \
    --node-config-details '{
        "size": 2,
        "is-pv-encryption-in-transit-enabled": true
    }' \
    --node-shape-config '{
        "ocpus": 15,
        "memory-in-gbs": 240
    }'

Install NVIDIA Device Plugin:

# Deploy NVIDIA device plugin
kubectl create -f - <<EOF
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      priorityClassName: system-node-critical
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.14.1
        name: nvidia-device-plugin-ctr
        env:
          - name: FAIL_ON_INIT_ERROR
            value: "false"
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins
EOF

# Verify GPU nodes
kubectl get nodes -l node.kubernetes.io/instance-type=GPU
kubectl describe nodes | grep -A 10 "Allocatable"

Deploy LLM Workloads with vLLM

Production deployment of Llama 2 7B using vLLM for high-throughput inference.

Build Container Image:

# Dockerfile
FROM nvidia/cuda:11.8.0-runtime-ubuntu22.04

WORKDIR /app

# Install Python and dependencies
RUN apt-get update && apt-get install -y \
    python3.10 \
    python3-pip \
    git \
    && rm -rf /var/lib/apt/lists/*

# Install vLLM and dependencies
RUN pip install --no-cache-dir \
    vllm==0.2.7 \
    transformers==4.36.0 \
    torch==2.1.0 \
    fastapi==0.109.0 \
    uvicorn==0.27.0

# Download model (or mount from storage)
RUN python3 -c "from transformers import AutoModelForCausalLM, AutoTokenizer; \
    AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf'); \
    AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-chat-hf')"

COPY serve.py /app/

EXPOSE 8000

CMD ["python3", "serve.py"]

serve.py:

from vllm import LLM, SamplingParams
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import uvicorn
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

app = FastAPI()

# Initialize vLLM
llm = LLM(
    model="meta-llama/Llama-2-7b-chat-hf",
    tensor_parallel_size=1,
    gpu_memory_utilization=0.95,
    max_num_batched_tokens=8192,
    max_num_seqs=256
)

class GenerateRequest(BaseModel):
    prompt: str
    max_tokens: int = 256
    temperature: float = 0.8
    top_p: float = 0.95

@app.post("/generate")
async def generate(request: GenerateRequest):
    try:
        sampling_params = SamplingParams(
            temperature=request.temperature,
            top_p=request.top_p,
            max_tokens=request.max_tokens
        )
        outputs = llm.generate([request.prompt], sampling_params)
        return {
            "text": outputs[0].outputs[0].text,
            "tokens": len(outputs[0].outputs[0].token_ids)
        }
    except Exception as e:
        logger.error(f"Generation error: {e}")
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/health")
async def health():
    return {"status": "healthy"}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

Build and push to OCIR:

# Login to Oracle Container Registry
docker login us-ashburn-1.ocir.io -u $TENANCY_NAMESPACE/oracleidentitycloudservice/$USERNAME

# Build image
docker build -t llama-vllm:v1.0 .

# Tag and push
docker tag llama-vllm:v1.0 us-ashburn-1.ocir.io/$TENANCY_NAMESPACE/llama-vllm:v1.0
docker push us-ashburn-1.ocir.io/$TENANCY_NAMESPACE/llama-vllm:v1.0

Kubernetes Deployment:

# llama-deployment.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: llm-inference
---
apiVersion: v1
kind: Secret
metadata:
  name: ocir-secret
  namespace: llm-inference
type: kubernetes.io/dockerconfigjson
data:
  .dockerconfigjson: <base64-encoded-docker-config>
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-7b-inference
  namespace: llm-inference
spec:
  replicas: 3
  selector:
    matchLabels:
      app: llama-7b
      version: v1
  template:
    metadata:
      labels:
        app: llama-7b
        version: v1
    spec:
      imagePullSecrets:
      - name: ocir-secret
      nodeSelector:
        node.kubernetes.io/instance-type: GPU
      containers:
      - name: vllm
        image: us-ashburn-1.ocir.io/namespace/llama-vllm:v1.0
        resources:
          requests:
            nvidia.com/gpu: 1
            memory: 32Gi
            cpu: 8
          limits:
            nvidia.com/gpu: 1
            memory: 48Gi
            cpu: 12
        ports:
        - containerPort: 8000
          name: http
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 5
        env:
        - name: CUDA_VISIBLE_DEVICES
          value: "0"
        - name: VLLM_LOGGING_LEVEL
          value: "INFO"
---
apiVersion: v1
kind: Service
metadata:
  name: llama-7b-service
  namespace: llm-inference
spec:
  selector:
    app: llama-7b
  ports:
  - port: 80
    targetPort: 8000
    protocol: TCP
  type: ClusterIP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: llama-7b-ingress
  namespace: llm-inference
  annotations:
    kubernetes.io/ingress.class: "nginx"
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
  tls:
  - hosts:
    - llm.example.com
    secretName: llm-tls-cert
  rules:
  - host: llm.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: llama-7b-service
            port:
              number: 80

Deploy to cluster:

# Apply configurations
kubectl apply -f llama-deployment.yaml

# Verify deployment
kubectl get pods -n llm-inference
kubectl get svc -n llm-inference
kubectl logs -n llm-inference -l app=llama-7b --tail=50

# Test inference
POD=$(kubectl get pod -n llm-inference -l app=llama-7b -o jsonpath='{.items[0].metadata.name}')
kubectl exec -n llm-inference $POD -- curl -X POST http://localhost:8000/generate \
    -H "Content-Type: application/json" \
    -d '{"prompt": "What is machine learning?", "max_tokens": 100}'

Horizontal Pod Autoscaling

Auto-scale deployments based on GPU and CPU metrics.

# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llama-7b-hpa
  namespace: llm-inference
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llama-7b-inference
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30
      - type: Pods
        value: 2
        periodSeconds: 30
      selectPolicy: Max

Install Metrics Server:

# Deploy metrics server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Apply HPA
kubectl apply -f hpa.yaml

# Monitor autoscaling
kubectl get hpa -n llm-inference -w

Monitoring with Prometheus and Grafana

Track inference performance and GPU utilization.

Deploy Prometheus Stack:

# Add Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install kube-prometheus-stack
helm install prometheus prometheus-community/kube-prometheus-stack \
    --namespace monitoring \
    --create-namespace \
    --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
    --set grafana.adminPassword=admin123

# Install NVIDIA DCGM Exporter for GPU metrics
kubectl create -f - <<EOF
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: dcgm-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  template:
    metadata:
      labels:
        app: dcgm-exporter
    spec:
      nodeSelector:
        node.kubernetes.io/instance-type: GPU
      containers:
      - name: dcgm-exporter
        image: nvcr.io/nvidia/k8s/dcgm-exporter:3.1.7-3.1.4-ubuntu20.04
        ports:
        - containerPort: 9400
          name: metrics
        securityContext:
          runAsNonRoot: false
          runAsUser: 0
        volumeMounts:
        - name: pod-gpu-resources
          readOnly: true
          mountPath: /var/lib/kubelet/pod-resources
      volumes:
      - name: pod-gpu-resources
        hostPath:
          path: /var/lib/kubelet/pod-resources
EOF

# Access Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80

Cost Analysis

OKE cluster costs for LLM deployments.

Small Production (Llama 2 7B):

  • 3x VM.GPU.A10.1 nodes: $3,285/month
  • OKE control plane: $0 (free)
  • Load Balancer: $60/month
  • Block storage (1TB): $50/month
  • Total: $3,395/month
  • Handles: 200-300 requests/min

Medium Production (Llama 2 13B):

  • 3x VM.GPU.A100.1 nodes: $6,459/month
  • OKE control plane: $0
  • Load Balancer: $60/month
  • Block storage (2TB): $100/month
  • Total: $6,619/month
  • Handles: 150-200 requests/min

Large Production (Llama 2 70B):

  • 2x BM.GPU.A100-v2.8 nodes: $34,456/month
  • OKE control plane: $0
  • Load Balancer (flexible): $150/month
  • Block storage (5TB): $250/month
  • Total: $34,856/month
  • Handles: 40-80 requests/min

Conclusion

Oracle Kubernetes Engine provides production-ready infrastructure for scalable LLM deployments. OKE eliminates control plane costs while delivering managed Kubernetes masters and automatic OS patching. GPU node pools with A10 and A100 instances support models from 7B to 70B+ parameters with horizontal scaling from 2 to 20+ nodes. Container-based deployments enable zero-downtime updates through rolling deployment strategies and rapid scaling with sub-5-minute node provisioning. Horizontal pod autoscaling adjusts replica counts based on CPU and memory utilization automatically. Monitor GPU utilization and inference performance using Prometheus and Grafana dashboards. Storage options including Block Volumes and File Storage optimize for different access patterns and pod startup times. Start with small 3-node clusters for development, then scale to multi-node production configurations as traffic grows.


Frequently Asked Questions

What are the key differences between deploying LLMs on OKE versus managed services like AWS EKS or Azure AKS, and how do costs compare?

Oracle Kubernetes Engine offers significant advantages for LLM deployments compared to AWS EKS and Azure AKS. The most substantial difference is zero control plane costs: OKE provides fully managed Kubernetes masters at no charge, while EKS costs $73/month per cluster and AKS charges $73/month for uptime SLA. For GPU compute, OKE pricing is 25-35% lower: VM.GPU.A100.1 runs $2,153/month versus $2,920 on AWS (p3.2xlarge) and $2,750 on Azure (NC6s v3). Network egress is dramatically cheaper on OCI at $0.0085/GB versus AWS $0.09/GB, critical for high-throughput LLM APIs serving millions of tokens daily. OKE integrates natively with OCI services: block volumes attach in 10 seconds versus 45-60 seconds on EKS, and OCIR (container registry) provides unlimited storage at no additional cost. Load balancer costs are comparable: OCI charges $60/month for 100 Mbps versus AWS ALB at $22/month plus $0.008/LCU-hour. However, EKS offers superior ecosystem integration with AWS services like SageMaker and Bedrock, while AKS provides better Azure OpenAI integration. OKE shines for cost-conscious deployments: a 3-node A100 cluster costs $6,619/month on OCI versus $9,133/month on AWS and $8,523/month on Azure, saving $30,168-$22,248 annually while delivering equivalent performance.

How do I implement zero-downtime deployments for LLM models on OKE when updating to newer model versions or changing inference configurations?

Implementing zero-downtime LLM updates on OKE requires rolling deployment strategies with careful resource management. Use Kubernetes Deployments with RollingUpdate strategy, setting maxSurge to 1 and maxUnavailable to 0, ensuring new pods start before old pods terminate. Configure readiness probes with 60-second initialDelaySeconds to allow model loading, preventing traffic routing to unready pods. Deploy new model versions using blue-green strategy: create parallel deployment with v2 label, verify functionality via internal testing service, then switch ingress traffic atomically by updating service selector from version: v1 to version: v2. Use Kubernetes Jobs for model preloading: create init containers that download model weights to shared PersistentVolume, reducing pod startup time from 180 seconds to 30 seconds. Implement canary deployments for gradual rollouts: route 10% traffic to new version using Istio or NGINX Ingress weighted routing, monitor error rates and latency P95 for 30 minutes, incrementally increase to 50% then 100% over 2 hours. For configuration updates like temperature or max_tokens, use ConfigMaps with automatic reload: mount ConfigMap as volume, implement file watcher in application code to reload settings without pod restart. Handle GPU memory efficiently during updates by temporarily over-provisioning node pool: scale up 1 additional GPU node before deployment, allowing new pods to schedule on fresh capacity, then drain old nodes gracefully using kubectl drain with 600-second timeout. Expected downtime: zero with proper configuration, though users may experience 50-100ms latency increase during rollover periods. Monitor deployment progress using kubectl rollout status and implement automatic rollback on failure: set progressDeadlineSeconds to 600, triggering rollback if new ReplicaSet fails to achieve ready state within 10 minutes.

What storage options work best for LLM model weights on OKE, and how do I optimize for fast pod startup times across multiple nodes?

Optimal storage strategy for LLM model weights on OKE combines Block Volumes for single-node deployments and File Storage for multi-node scenarios. For models under 50GB like Llama 2 7B, use OCI Block Volumes with Ultra High Performance tier (100 IOPS/GB, 480 KB/s per GB throughput), mounting as PersistentVolume with ReadWriteOnce access mode. This configuration loads 20GB model in 15-20 seconds versus 60-90 seconds with standard performance tier. Configure volumeClaimTemplate with 200GB capacity, enabling Kubernetes dynamic provisioning and automatic attachment to GPU nodes. For multi-replica deployments across multiple nodes, use OCI File Storage (NFS) with ReadWriteMany access mode, allowing simultaneous model access from all pods. Create File Storage with 100GB capacity in same availability domain as worker nodes, mounting at /models path. Performance: File Storage delivers 6.4 GB/s throughput for parallel reads, loading Llama 2 13B (26GB) in 8-12 seconds across 5 simultaneous pods. Implement init containers for model caching: download models from OCIR to local NVMe storage during pod initialization, then symlink to application path. Alternative: Use DaemonSet to pre-cache models on all GPU nodes, storing in hostPath volume at /data/models, reducing pod startup to 5 seconds by eliminating network transfers. For models exceeding 100GB like Llama 2 70B, use OCI Object Storage with intelligent tiering: store base model in Object Storage Standard ($0.0255/GB/month), maintain 3 cached copies on File Storage ($0.025/GB/month) across availability domains for rapid access. Implement cache warming strategy: deploy Kubernetes CronJob running hourly, reading first 1GB of each model file to keep data in File Storage cache. Cost comparison for 100GB Llama 2 70B deployment: Block Volumes cost $12.75/month per node ($38.25 for 3 nodes) versus File Storage $2.50/month (shared), saving $35.75/month while improving pod startup parallelism. Monitor storage performance using Prometheus: track volume_read_bytes_total and volume_read_latency_seconds metrics, alerting when P95 latency exceeds 100ms.