Deploy Production LLMs on OKE Kubernetes

Deploy LLMs on Oracle Kubernetes Engine with GPU support. Complete guide covers OKE cluster setup, GPU nodes, vLLM deployments, auto-scaling, and monitoring patterns.

TLDR;

  • Zero control plane costs versus $73/month on AWS EKS and Azure AKS
  • 3x VM.GPU.A10.1 nodes handle 200-300 requests/min at $3,395/month total
  • HPA scales pods on CPU/memory with 5-minute scale-down stabilization
  • NVIDIA DCGM Exporter tracks GPU utilization, temperature, and power draw

Deploy large language models on Oracle Kubernetes Engine with GPU support. This guide covers OKE cluster setup from zero to production, GPU node pool configuration with A10 and A100 instances, and production-grade deployments using vLLM.

Kubernetes provides the orchestration layer needed for scalable LLM inference. OKE delivers fully managed Kubernetes clusters with zero control plane costs and native OCI service integration. Deploy containerized LLM workloads that scale from 2 to 20+ GPU nodes based on traffic demand.

OKE with A10 and A100 GPU node pools running vLLM for 7B to 70B+ models.

Learn production patterns including:

  • GPU device plugins for proper resource allocation
  • Horizontal pod autoscaling based on utilization metrics
  • Zero-downtime rolling deployments for model updates
  • Monitoring stacks using Prometheus and Grafana

OKE clusters handle 200-300 requests per minute on small deployments and scale to thousands of requests for enterprise workloads.

OKE Architecture Overview

Oracle Kubernetes Engine provides a fully managed Kubernetes service optimized for GPU workloads. OKE integrates with OCI networking, storage, and security services for enterprise-grade deployments.

Key Components:

  • Control Plane: Fully managed Kubernetes masters (free)
  • Worker Nodes: GPU-enabled compute instances
  • Container Registry (OCIR): Private Docker registry
  • Load Balancer: Managed ingress with SSL termination
  • Block Storage: Persistent volumes for model storage
  • File Storage: Shared NFS for multi-pod access

Architecture Benefits:

  • Zero control plane costs
  • Native integration with OCI services
  • Automatic OS patching and updates
  • Support for mixed CPU/GPU node pools
  • Regional and multi-AD deployments
  • Built-in pod security policies

Create Production OKE Cluster

Set up a production-ready Kubernetes cluster with high availability.

# Create VCN for cluster
oci network vcn create \
    --compartment-id $COMPARTMENT_ID \
    --display-name llm-vcn \
    --cidr-blocks '["10.0.0.0/16"]' \
    --dns-label llmvcn

# Create subnets
oci network subnet create \
    --compartment-id $COMPARTMENT_ID \
    --vcn-id $VCN_ID \
    --display-name control-plane-subnet \
    --cidr-block 10.0.1.0/24 \
    --dns-label k8sapi

oci network subnet create \
    --compartment-id $COMPARTMENT_ID \
    --vcn-id $VCN_ID \
    --display-name worker-subnet \
    --cidr-block 10.0.10.0/24 \
    --dns-label workers

oci network subnet create \
    --compartment-id $COMPARTMENT_ID \
    --vcn-id $VCN_ID \
    --display-name loadbalancer-subnet \
    --cidr-block 10.0.20.0/24 \
    --dns-label loadbalancer

# Create OKE cluster
oci ce cluster create \
    --compartment-id $COMPARTMENT_ID \
    --name llm-production-cluster \
    --kubernetes-version v1.28.2 \
    --vcn-id $VCN_ID \
    --endpoint-subnet-id $CONTROL_PLANE_SUBNET_ID \
    --service-lb-subnet-ids "[$LB_SUBNET_ID]" \
    --cluster-pod-network-options '[{
        "cni-type": "FLANNEL_OVERLAY"
    }]' \
    --options '{
        "service-lb-config": {
            "subnet-ids": ["'$LB_SUBNET_ID'"]
        },
        "kubernetes-network-config": {
            "pods-cidr": "10.244.0.0/16",
            "services-cidr": "10.96.0.0/16"
        }
    }' \
    --wait-for-state ACTIVE

# Get kubeconfig
oci ce cluster create-kubeconfig \
    --cluster-id $CLUSTER_ID \
    --file ~/.kube/config \
    --region us-ashburn-1 \
    --token-version 2.0.0

# Verify cluster access
kubectl cluster-info
kubectl get nodes

Cluster creation time: 7-10 minutes

Configure GPU Node Pools

Add GPU-enabled worker nodes with optimized configurations.

Create GPU Node Pool:

# gpu-startup.sh - Node initialization script
cat << 'EOF' > gpu-startup.sh
#!/bin/bash
# Install NVIDIA drivers
dnf install -y kernel-devel-$(uname -r) kernel-headers-$(uname -r)
dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
dnf install -y nvidia-driver-535 nvidia-utils-535

# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.repo | tee /etc/yum.repos.d/nvidia-container-toolkit.repo
dnf install -y nvidia-container-toolkit
systemctl restart docker

# Verify GPU
nvidia-smi
EOF

# Create node pool with A100 GPUs
oci ce node-pool create \
    --cluster-id $CLUSTER_ID \
    --compartment-id $COMPARTMENT_ID \
    --name gpu-a100-pool \
    --node-shape VM.GPU.A100.1 \
    --node-source-details '{
        "source-type": "IMAGE",
        "image-id": "'$IMAGE_ID'"
    }' \
    --size 3 \
    --placement-configs '[{
        "availability-domain": "US-ASHBURN-AD-1",
        "subnet-id": "'$WORKER_SUBNET_ID'"
    }]' \
    --node-config-details '{
        "size": 3,
        "placement-configs": [{
            "availability-domain": "US-ASHBURN-AD-1",
            "subnet-id": "'$WORKER_SUBNET_ID'"
        }],
        "node-metadata": {
            "user_data": "'$(base64 -w 0 gpu-startup.sh)'"
        }
    }' \
    --node-shape-config '{
        "ocpus": 15,
        "memory-in-gbs": 240
    }' \
    --wait-for-state ACTIVE

# Create auto-scaling node pool for A10 GPUs
oci ce node-pool create \
    --cluster-id $CLUSTER_ID \
    --compartment-id $COMPARTMENT_ID \
    --name gpu-a10-autoscale \
    --node-shape VM.GPU.A10.1 \
    --size 2 \
    --placement-configs '[{
        "availability-domain": "US-ASHBURN-AD-1",
        "subnet-id": "'$WORKER_SUBNET_ID'"
    }]' \
    --node-config-details '{
        "size": 2,
        "is-pv-encryption-in-transit-enabled": true
    }' \
    --node-shape-config '{
        "ocpus": 15,
        "memory-in-gbs": 240
    }'

Install NVIDIA Device Plugin:

# Deploy NVIDIA device plugin
kubectl create -f - <<EOF
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      priorityClassName: system-node-critical
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.14.1
        name: nvidia-device-plugin-ctr
        env:
          - name: FAIL_ON_INIT_ERROR
            value: "false"
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins
EOF

# Verify GPU nodes
kubectl get nodes -l node.kubernetes.io/instance-type=GPU
kubectl describe nodes | grep -A 10 "Allocatable"

Deploy LLM Workloads with vLLM

Production deployment of Llama 2 7B using vLLM for high-throughput inference.

Build Container Image:

# Dockerfile
FROM nvidia/cuda:11.8.0-runtime-ubuntu22.04

WORKDIR /app

# Install Python and dependencies
RUN apt-get update && apt-get install -y \
    python3.10 \
    python3-pip \
    git \
    && rm -rf /var/lib/apt/lists/*

# Install vLLM and dependencies
RUN pip install --no-cache-dir \
    vllm==0.2.7 \
    transformers==4.36.0 \
    torch==2.1.0 \
    fastapi==0.109.0 \
    uvicorn==0.27.0

# Download model (or mount from storage)
RUN python3 -c "from transformers import AutoModelForCausalLM, AutoTokenizer; \
    AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf'); \
    AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-chat-hf')"

COPY serve.py /app/

EXPOSE 8000

CMD ["python3", "serve.py"]

serve.py:

from vllm import LLM, SamplingParams
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import uvicorn
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

app = FastAPI()

# Initialize vLLM
llm = LLM(
    model="meta-llama/Llama-2-7b-chat-hf",
    tensor_parallel_size=1,
    gpu_memory_utilization=0.95,
    max_num_batched_tokens=8192,
    max_num_seqs=256
)

class GenerateRequest(BaseModel):
    prompt: str
    max_tokens: int = 256
    temperature: float = 0.8
    top_p: float = 0.95

@app.post("/generate")
async def generate(request: GenerateRequest):
    try:
        sampling_params = SamplingParams(
            temperature=request.temperature,
            top_p=request.top_p,
            max_tokens=request.max_tokens
        )
        outputs = llm.generate([request.prompt], sampling_params)
        return {
            "text": outputs[0].outputs[0].text,
            "tokens": len(outputs[0].outputs[0].token_ids)
        }
    except Exception as e:
        logger.error(f"Generation error: {e}")
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/health")
async def health():
    return {"status": "healthy"}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

Build and push to OCIR:

# Login to Oracle Container Registry
docker login us-ashburn-1.ocir.io -u $TENANCY_NAMESPACE/oracleidentitycloudservice/$USERNAME

# Build image
docker build -t llama-vllm:v1.0 .

# Tag and push
docker tag llama-vllm:v1.0 us-ashburn-1.ocir.io/$TENANCY_NAMESPACE/llama-vllm:v1.0
docker push us-ashburn-1.ocir.io/$TENANCY_NAMESPACE/llama-vllm:v1.0

Kubernetes Deployment:

# llama-deployment.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: llm-inference
---
apiVersion: v1
kind: Secret
metadata:
  name: ocir-secret
  namespace: llm-inference
type: kubernetes.io/dockerconfigjson
data:
  .dockerconfigjson: <base64-encoded-docker-config>
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-7b-inference
  namespace: llm-inference
spec:
  replicas: 3
  selector:
    matchLabels:
      app: llama-7b
      version: v1
  template:
    metadata:
      labels:
        app: llama-7b
        version: v1
    spec:
      imagePullSecrets:
      - name: ocir-secret
      nodeSelector:
        node.kubernetes.io/instance-type: GPU
      containers:
      - name: vllm
        image: us-ashburn-1.ocir.io/namespace/llama-vllm:v1.0
        resources:
          requests:
            nvidia.com/gpu: 1
            memory: 32Gi
            cpu: 8
          limits:
            nvidia.com/gpu: 1
            memory: 48Gi
            cpu: 12
        ports:
        - containerPort: 8000
          name: http
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 5
        env:
        - name: CUDA_VISIBLE_DEVICES
          value: "0"
        - name: VLLM_LOGGING_LEVEL
          value: "INFO"
---
apiVersion: v1
kind: Service
metadata:
  name: llama-7b-service
  namespace: llm-inference
spec:
  selector:
    app: llama-7b
  ports:
  - port: 80
    targetPort: 8000
    protocol: TCP
  type: ClusterIP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: llama-7b-ingress
  namespace: llm-inference
  annotations:
    kubernetes.io/ingress.class: "nginx"
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
  tls:
  - hosts:
    - llm.example.com
    secretName: llm-tls-cert
  rules:
  - host: llm.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: llama-7b-service
            port:
              number: 80

Deploy to cluster:

# Apply configurations
kubectl apply -f llama-deployment.yaml

# Verify deployment
kubectl get pods -n llm-inference
kubectl get svc -n llm-inference
kubectl logs -n llm-inference -l app=llama-7b --tail=50

# Test inference
POD=$(kubectl get pod -n llm-inference -l app=llama-7b -o jsonpath='{.items[0].metadata.name}')
kubectl exec -n llm-inference $POD -- curl -X POST http://localhost:8000/generate \
    -H "Content-Type: application/json" \
    -d '{"prompt": "What is machine learning?", "max_tokens": 100}'

Horizontal Pod Autoscaling

Auto-scale deployments based on GPU and CPU metrics.

# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llama-7b-hpa
  namespace: llm-inference
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llama-7b-inference
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30
      - type: Pods
        value: 2
        periodSeconds: 30
      selectPolicy: Max

Install Metrics Server:

# Deploy metrics server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Apply HPA
kubectl apply -f hpa.yaml

# Monitor autoscaling
kubectl get hpa -n llm-inference -w

HPA for GPU workloads requires careful tuning. We configure it for your traffic pattern.

Scale-up: 100% every 30s (fast response). Scale-down: 50% every 60s with 300s stabilization (prevent thrashing). Custom metrics for queue depth? We implement those too.

We help you:

  • Configure HPA with CPU/memory thresholds – 70% CPU, 80% memory typical for LLM inference
  • Set stabilization windows – Prevent scale-down thrashing during traffic fluctuations
  • Install metrics server – Required for resource-based autoscaling
  • Implement custom metrics – Scale based on queue depth or request latency
Get OKE Autoscaling Expertise →

Monitoring with Prometheus and Grafana

Track inference performance and GPU utilization.

OKE LLM dashboard: 45 req/s, GPU utilization A100 78% / A10 82%, p95 latency 320ms, 4 active replicas, GPU memory usage trend.

Deploy Prometheus Stack:

# Add Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install kube-prometheus-stack
helm install prometheus prometheus-community/kube-prometheus-stack \
    --namespace monitoring \
    --create-namespace \
    --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
    --set grafana.adminPassword=admin123

# Install NVIDIA DCGM Exporter for GPU metrics
kubectl create -f - <<EOF
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: dcgm-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  template:
    metadata:
      labels:
        app: dcgm-exporter
    spec:
      nodeSelector:
        node.kubernetes.io/instance-type: GPU
      containers:
      - name: dcgm-exporter
        image: nvcr.io/nvidia/k8s/dcgm-exporter:3.1.7-3.1.4-ubuntu20.04
        ports:
        - containerPort: 9400
          name: metrics
        securityContext:
          runAsNonRoot: false
          runAsUser: 0
        volumeMounts:
        - name: pod-gpu-resources
          readOnly: true
          mountPath: /var/lib/kubelet/pod-resources
      volumes:
      - name: pod-gpu-resources
        hostPath:
          path: /var/lib/kubelet/pod-resources
EOF

# Access Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80

Cost Analysis

OKE cluster costs for LLM deployments.

Configuration Node Type Nodes Node Cost OKE Control Plane Load Balancer Storage Total Monthly Requests/Min
Small Production (Llama 2 7B) VM.GPU.A10.1 3 $3,285 $0 $60 $50 (1TB) $3,395 200-300
Medium Production (Llama 2 13B) VM.GPU.A100.1 3 $6,459 $0 $60 $100 (2TB) $6,619 150-200
Large Production (Llama 2 70B) BM.GPU.A100-v2.8 2 $34,456 $0 $150 $250 (5TB) $34,856 40-80

Conclusion

Oracle Kubernetes Engine provides production-ready infrastructure for scalable LLM deployments. OKE eliminates control plane costs while delivering managed Kubernetes masters and automatic OS patching. GPU node pools with A10 and A100 instances support models from 7B to 70B+ parameters with horizontal scaling from 2 to 20+ nodes.

Container-based deployments enable zero-downtime updates through rolling deployment strategies and rapid scaling with sub-5-minute node provisioning. Horizontal pod autoscaling adjusts replica counts based on CPU and memory utilization automatically.

Feature Benefit
Control plane cost $0 (free)
Node provisioning Sub-5 minutes
Horizontal scaling 2 to 20+ nodes
GPU support A10 and A100 instances
Model support 7B to 70B+ parameters
Deployment strategy Zero-downtime rolling updates
Autoscaling Horizontal Pod Autoscaler (CPU, memory metrics)

Monitor GPU utilization and inference performance using Prometheus and Grafana dashboards. Storage options including Block Volumes and File Storage optimize for different access patterns and pod startup times. Start with small 3-node clusters for development, then scale to multi-node production configurations as traffic grows.


Frequently Asked Questions

1. OKE vs. AWS EKS vs. Azure AKS – Key Differences & Costs

Category OKE (OCI) AWS EKS Azure AKS
Control plane $0 $73/month $73/month
A100 GPU node $2,153/month $2,920/month $2,750/month
Network egress $0.0085/GB $0.09/GB ~$0.08/GB
Volume attach time 10 seconds 45-60 seconds 30-45 seconds
Container registry Free (unlimited) $0.10/GB storage $0.10/GB storage
Load balancer $60/month $22/month + variable ~$20/month

3-node A100 cluster monthly cost:

  • OKE: $6,619
  • AWS: $9,133
  • Azure: $8,523

Annual savings vs. AWS: $30,168

Verdict: OKE wins on cost. EKS/AKS win on ecosystem (SageMaker, Bedrock, Azure OpenAI).

2. Zero-Downtime Deployments for LLMs

Strategy Configuration Best For
Rolling update maxSurge: 1, maxUnavailable: 0 Minor version updates
Blue-green Parallel v1/v2, atomic service selector switch Major model upgrades
Canary Weighted routing (10% → 50% → 100%) Riskier changes

Key requirements:

  • Readiness probe: initialDelaySeconds: 60 (model loading time)
  • Init container to pre-download weights to shared PV (180s → 30s startup)
  • ConfigMap + file watcher for config reloads (no restart)
  • Over-provision 1 GPU node → drain old nodes with kubectl drain --timeout=600s
  • Auto-rollback: progressDeadlineSeconds: 600

Result: Zero downtime (50-100ms latency increase during rollover)

3. Storage for LLM Model Weights – Fast Pod Startup

Scenario Storage Performance
Single node, model <50GB Block Volume (Ultra High Performance) 20GB in 15-20s
Multi-node, shared access File Storage (NFS, ReadWriteMany) 26GB in 8-12s (5 pods)
Max performance DaemonSet pre-cache to hostPath (NVMe) Startup in 5s
Model >100GB Object Storage + File Storage cache Hourly cache warming

Cost comparison (100GB model):

  • Block Volumes (3 nodes): $38.25/month
  • File Storage (shared): $2.50/month (saves $35.75)

Implementation tips:

  • Init container for model caching from OCIR to local NVMe
  • CronJob to warm cache (read first 1GB of each model file hourly)
  • Monitor with Prometheus: volume_read_latency_seconds – alert if P95 > 100ms

Best practice: File Storage for multi-node + DaemonSet caching for 5-second pod startup

Expert Cloud Consulting

Ready to put this into production?

Our engineers have deployed these architectures across 100+ client engagements — from AWS migrations to Kubernetes clusters to AI infrastructure. We turn complex cloud challenges into measurable outcomes.

100+ Deployments
99.99% Uptime SLA
15 min Response time