Scale LLMs Serverlessly on Container Apps

Deploy LLMs on Azure Container Apps with serverless scale-to-zero, KEDA autoscaling, and blue-green deployments to cut costs by up to 80%, eliminate cluster management, and pay only for actual usage in event-driven and variable workloads.

Scale LLMs Serverlessly on Container Apps

TLDR;

  • Scale-to-zero reduces costs by 80% compared to always-on AKS deployments
  • KEDA autoscaling responds to HTTP requests, queues, or custom metrics automatically
  • Blue-green deployments through revision management enable zero-downtime updates
  • Monthly cost as low as $27 versus $150 for equivalent AKS node at 100 hours utilization

Introduction

Container Apps provides serverless container hosting with powerful scaling capabilities specifically designed for event-driven workloads. Deploy any containerized model without managing infrastructure. Scale automatically based on load including scale-to-zero when idle. Pay only when processing requests. Container Apps eliminates cluster management complexity while providing production-ready features.

The platform offers KEDA-based autoscaling triggered by HTTP requests, queue messages, or custom metrics. Built-in ingress and load balancing distribute traffic automatically. Managed certificates provide HTTPS without configuration. Deep integration with Azure ecosystem simplifies authentication and monitoring. Cost advantages become significant for variable workloads. Container Apps costs zero dollars when idle compared to AKS requiring $73 monthly control plane plus node costs exceeding $1,000 monthly. Organizations save 50-80% for workloads with variable traffic patterns. This guide covers deployment to Container Apps environment, advanced scaling configurations, blue-green deployments through revision management, cost optimization with scale-to-zero, and integration with Azure services. Container Apps works best for variable traffic patterns, batch processing workloads, development and staging environments, cost-sensitive deployments, teams without Kubernetes expertise, and event-driven inference scenarios.

Deployment Configuration

Deploy LLM inference service to Container Apps environment.

Create Container Apps environment:

# Create resource group
az group create \
--name ml-serverless \
--location eastus

# Create environment
az containerapp env create \
--name llm-environment \
--resource-group ml-serverless \
--location eastus \
--logs-destination log-analytics \
--logs-workspace-id "/subscriptions/.../workspaces/ml-logs"

Build and push container image:

# Dockerfile for vLLM inference
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04

RUN apt-get update && apt-get install -y python3-pip
RUN pip3 install vllm transformers torch

COPY inference_server.py /app/
WORKDIR /app

EXPOSE 8000

CMD ["python3", "inference_server.py"]
# Build and push
docker build -t llama-inference:latest .
docker tag llama-inference:latest myregistry.azurecr.io/llama-inference:latest
az acr login --name myregistry
docker push myregistry.azurecr.io/llama-inference:latest

Deploy container app:

az containerapp create \
--name llama-inference \
--resource-group ml-serverless \
--environment llm-environment \
--image myregistry.azurecr.io/llama-inference:latest \
--registry-server myregistry.azurecr.io \
--registry-identity system \
--target-port 8000 \
--ingress external \
--min-replicas 0 \
--max-replicas 10 \
--cpu 4 \
--memory 16Gi \
--env-vars \
MODEL_PATH="/models/llama-7b" \
GPU_MEMORY_UTILIZATION="0.95"

Advanced Scaling Configuration

Configure autoscaling with multiple trigger types for responsive scaling behavior.

HTTP-based scaling responds to concurrent request load:

az containerapp update \
--name llama-inference \
--resource-group ml-serverless \
--min-replicas 0 \
--max-replicas 20 \
--scale-rule-name http-rule \
--scale-rule-type http \
--scale-rule-http-concurrency 5

Scaling behavior follows these patterns. With 0-5 concurrent requests, system runs 1 replica. With 6-10 requests, scales to 2 replicas. With 11-15 requests, scales to 3 replicas. System scales up and down automatically based on traffic.

Queue-based scaling processes inference requests from Azure Storage Queue:

az containerapp update \
--name llama-batch \
--resource-group ml-serverless \
--min-replicas 0 \
--max-replicas 50 \
--scale-rule-name queue-rule \
--scale-rule-type azure-queue \
--scale-rule-metadata \
queueName=inference-requests \
queueLength=10 \
accountName=mystorageaccount \
--scale-rule-auth secretRef=storage-connection

Multi-criteria scaling combines HTTP and CPU triggers:

# HTTP scaling
az containerapp update \
--name llama-inference \
--scale-rule-name http-rule \
--scale-rule-type http \
--scale-rule-http-concurrency 10

# CPU scaling
az containerapp update \
--name llama-inference \
--scale-rule-name cpu-rule \
--scale-rule-type cpu \
--scale-rule-metadata type=Utilization value=70

Combined behavior scales up when either rule triggers. Scales down only when all rules fall below thresholds. System becomes more responsive to traffic spikes.

Revision Management and Deployments

Deploy updates with zero downtime using Container Apps revision system.

Create new revision:

# Deploy new version
az containerapp update \
--name llama-inference \
--resource-group ml-serverless \
--image myregistry.azurecr.io/llama-inference:v2 \
--revision-suffix v2

# Split traffic for canary deployment
az containerapp ingress traffic set \
--name llama-inference \
--resource-group ml-serverless \
--revision-weight llama-inference--v1=90 llama-inference--v2=10

Blue-green deployment pattern:

# Deploy green version
az containerapp update \
--name llama-inference \
--resource-group ml-serverless \
--image myregistry.azurecr.io/llama-inference:v2 \
--revision-suffix green

# Test green revision internally
curl https://llama-inference--green.internal.example.com/health

# Switch all traffic to green
az containerapp ingress traffic set \
--name llama-inference \
--resource-group ml-serverless \
--revision-weight llama-inference--green=100

# Deactivate blue revision
az containerapp revision deactivate \
--name llama-inference \
--revision llama-inference--blue

Cost Optimization Strategies

Maximize savings with serverless architecture and scale-to-zero capability.

Enable scale-to-zero for development environment:

az containerapp update \
--name llama-dev \
--resource-group ml-serverless \
--min-replicas 0 \
--max-replicas 5 \
--scale-rule-name http-rule \
--scale-rule-type http \
--scale-rule-http-concurrency 3

Savings calculation shows traditional AKS costs $1,500 monthly running always. Container Apps with scale-to-zero costs $300 monthly at 20% utilization. Total savings reach $1,200 monthly representing 80% cost reduction.

Right-size resources based on model requirements:

# Small instances for 7B models
az containerapp update \
--name llama-7b \
--cpu 2 \
--memory 8Gi

# Larger instances for 70B models
az containerapp update \
--name llama-70b \
--cpu 8 \
--memory 32Gi

Container Apps consumption pricing charges $0.000024 per vCPU-second, $0.000003 per GiB-second, and $0.40 per million requests after first 2 million free. Example calculation for 7B model running 2 vCPU and 8GB memory active 100 hours monthly shows vCPU cost of $17.28, memory cost of $8.64, and request cost of $1.20 for 5 million requests. Total monthly cost equals $27.12 compared to AKS node costing approximately $150 monthly for 24/7 operation. Monthly savings reach $122.88 representing 82% cost reduction.

Azure Service Integration

Connect Container Apps to Azure ecosystem for storage, queuing, and security.

Mount Azure Files for model storage:

az containerapp update \
--name llama-inference \
--resource-group ml-serverless \
--azure-file-volume-share models-share \
--azure-file-volume-account-name mystorageaccount \
--azure-file-volume-account-key "storage-key" \
--azure-file-volume-mount-path /models

Azure Queue integration for batch processing:

from azure.storage.queue import QueueClient
import json

queue_client = QueueClient.from_connection_string(
connection_string,
queue_name="inference-requests"
)

while True:
messages = queue_client.receive_messages(messages_per_page=10)

for message in messages:
request = json.loads(message.content)

# Process inference
result = model.generate(request["prompt"])

# Store result
result_queue.send_message(json.dumps({
"request_id": request["id"],
"result": result
}))

# Delete processed message
queue_client.delete_message(message)

Enable managed identity for secure access:

# Enable managed identity
az containerapp identity assign \
--name llama-inference \
--resource-group ml-serverless \
--system-assigned

# Grant Key Vault permissions
az keyvault set-policy \
--name myvault \
--object-id <identity-principal-id> \
--secret-permissions get list

Monitor with Application Insights:

from opencensus.ext.azure.log_exporter import AzureLogHandler
import logging

logger = logging.getLogger(__name__)
logger.addHandler(AzureLogHandler(
connection_string="InstrumentationKey=your-key"
))

# Log inference metrics
logger.info("Inference completed", extra={
"custom_dimensions": {
"latency_ms": latency,
"tokens_generated": tokens,
"model": "llama-7b"
}
})

View logs with Azure CLI:

az containerapp logs show \
--name llama-inference \
--resource-group ml-serverless \
--follow \
--tail 100

Conclusion

Azure Container Apps delivers serverless LLM deployment with automatic scaling and zero infrastructure management. Scale-to-zero eliminates costs during idle periods. KEDA autoscaling responds to HTTP, queue, and custom metrics. Revision management enables blue-green deployments without downtime. Organizations achieve 50-80% cost savings compared to traditional Kubernetes for variable workloads. Container Apps works best for CPU-based inference with small to medium models, development environments, and batch processing. Start with basic HTTP scaling, enable scale-to-zero for cost optimization, implement queue-based scaling for batch workloads, and monitor with Application Insights. Your serverless deployment scales automatically while minimizing costs.