Automate LLM Deployments with Azure DevOps
Automate LLM deployments with Azure DevOps MLOps pipelines using CI/CD, blue-green releases, load testing, and approval gates to cut deployment time by 60%, reduce production errors by 75%, and ship reliable models to production with zero downtime.
TLDR;
- MLOps automation reduces deployment time by 60% while cutting production errors by 75%
- Blue-green deployments eliminate downtime with instant rollback capability
- Multi-stage pipelines with manual approval gates prevent bad releases to production
- Load testing validates P95 latency and 99% success rate before traffic switch
Introduction
Manual deployments break at scale. Teams waste hours on repetitive tasks while quality suffers. Azure DevOps brings DevOps practices to machine learning with automated pipelines that test, deploy, and monitor LLM models in production. This guide shows you how to build production-grade CI/CD workflows that automate training, testing, and deployment while maintaining quality and reliability.
Azure DevOps provides complete MLOps capabilities integrated with Azure ML. Version control with Git. Automated pipelines with YAML. Integration with Azure ML services. Artifact management and release gates. Track experiments, monitor models in production, and roll back failed deployments. Organizations using MLOps reduce deployment time by 60% while cutting production errors by 75%. This implementation guide covers build pipelines, multi-stage deployments, automated testing, and continuous monitoring for LLM workloads.
Pipeline Architecture and Repository Structure
Complete MLOps workflow spans source control, build, test, and deployment stages. The architecture includes Git repositories for code and configurations, build pipelines that test code and validate models, release pipelines that deploy to dev, staging, and production, model registry in Azure ML for versioning, and monitoring systems that track deployment health.
Organize your repository for CI/CD automation:
llm-deployment/
├── models/
│ ├── llama-7b/
│ │ ├── config.json
│ │ └── requirements.txt
│ └── phi-4/
├── src/
│ ├── inference/
│ │ ├── server.py
│ │ └── utils.py
│ └── tests/
│ ├── test_inference.py
│ └── test_preprocessing.py
├── deployment/
│ ├── azure-ml/
│ │ ├── endpoint.yml
│ │ └── deployment.yml
│ └── docker/
│ └── Dockerfile
├── pipelines/
│ ├── azure-pipelines.yml
│ ├── build-pipeline.yml
│ └── deploy-pipeline.yml
└── tests/
├── integration/
└── load/
This structure separates concerns and enables parallel pipeline execution. Model configurations stay isolated. Infrastructure code lives separately from application code. Test suites run independently.
Build and Test Pipeline
Validate code quality before deployment with automated testing and container builds.
# azure-pipelines.yml
trigger:
branches:
include:
- main
- develop
paths:
include:
- src/**
- models/**
variables:
azureSubscription: 'Azure-ML-Service-Connection'
resourceGroup: 'ml-production'
workspace: 'ml-workspace'
stages:
- stage: Build
jobs:
- job: Test
pool:
vmImage: 'ubuntu-latest'
steps:
- task: UsePythonVersion@0
inputs:
versionSpec: '3.10'
- script: |
pip install -r requirements.txt
pip install pytest pytest-cov
displayName: 'Install dependencies'
- script: |
pytest tests/ --cov=src --cov-report=xml
displayName: 'Run tests'
- task: PublishTestResults@2
inputs:
testResultsFormat: 'JUnit'
testResultsFiles: '**/test-results.xml'
- job: BuildContainer
dependsOn: Test
steps:
- task: Docker@2
inputs:
containerRegistry: $(acrName)
repository: 'llama-inference'
command: 'buildAndPush'
Dockerfile: 'deployment/docker/Dockerfile'
tags: |
$(Build.BuildId)
latest
Integration tests validate endpoint behavior:
# tests/integration/test_inference.py
import pytest
import requests
def test_inference_latency(endpoint_url, api_key):
"""Test that inference completes within SLA"""
headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
}
payload = {
"inputs": ["Explain quantum computing"],
"parameters": {"max_tokens": 100}
}
import time
start = time.time()
response = requests.post(endpoint_url, headers=headers, json=payload)
latency = (time.time() - start) * 1000
assert response.status_code == 200
assert latency < 1000, f"Latency {latency}ms exceeds SLA"
Multi-Stage Deployment Pipeline
Deploy through dev, staging, and production with automated gates and validation at each stage.
# deploy-pipeline.yml
trigger: none # Manual trigger only
stages:
- stage: DeployDev
jobs:
- deployment: DeployDevEndpoint
environment: 'development'
strategy:
runOnce:
deploy:
steps:
- task: AzureCLI@2
displayName: 'Deploy to Dev'
inputs:
azureSubscription: $(azureSubscription)
scriptType: 'bash'
scriptLocation: 'inlineScript'
inlineScript: |
az ml online-deployment create \
--file deployment/azure-ml/deployment.yml \
--endpoint-name llama-dev \
--set instance_count=1
- stage: DeployStaging
dependsOn: DeployDev
condition: succeeded()
jobs:
- deployment: DeployStagingEndpoint
environment: 'staging'
strategy:
runOnce:
deploy:
steps:
- task: AzureCLI@2
displayName: 'Deploy to Staging'
inputs:
inlineScript: |
az ml online-deployment create \
--endpoint-name llama-staging \
--set instance_count=2
- task: AzureCLI@2
displayName: 'Run Load Tests'
inputs:
inlineScript: |
python tests/load/load_test.py \
--endpoint llama-staging \
--duration 300
- stage: DeployProduction
dependsOn: DeployStaging
jobs:
- deployment: DeployProductionEndpoint
environment: 'production'
strategy:
runOnce:
deploy:
steps:
- task: AzureCLI@2
displayName: 'Blue-Green Deployment'
inputs:
inlineScript: |
# Deploy green version
az ml online-deployment create \
--endpoint-name llama-prod \
--name green-$(Build.BuildId) \
--set traffic=0
- task: ManualValidation@0
displayName: 'Approve Traffic Switch'
inputs:
notifyUsers: 'ml-ops@company.com'
- task: AzureCLI@2
displayName: 'Switch Traffic'
inputs:
inlineScript: |
az ml online-endpoint update \
--name llama-prod \
--traffic "green-$(Build.BuildId)=100"
Blue-green deployments eliminate downtime. Deploy new version with zero traffic. Run smoke tests against green deployment. Switch traffic gradually or all at once. Keep blue deployment running for instant rollback.
Load Testing and Performance Validation
Validate performance under realistic load before production release.
# tests/load/load_test.py
import concurrent.futures
import time
import requests
import statistics
class LoadTester:
def __init__(self, endpoint_url, api_key):
self.endpoint_url = endpoint_url
self.api_key = api_key
self.results = []
def make_request(self):
headers = {
"Authorization": f"Bearer {self.api_key}",
"Content-Type": "application/json"
}
payload = {
"inputs": ["Explain machine learning"],
"parameters": {"max_tokens": 100}
}
start = time.time()
try:
response = requests.post(
self.endpoint_url,
headers=headers,
json=payload,
timeout=30
)
latency = (time.time() - start) * 1000
return {
"status": response.status_code,
"latency": latency,
"success": response.status_code == 200
}
except Exception as e:
return {"status": 0, "latency": 0, "success": False}
def run_load_test(self, duration, concurrent_users):
end_time = time.time() + duration
with concurrent.futures.ThreadPoolExecutor(max_workers=concurrent_users) as executor:
while time.time() < end_time:
futures = [executor.submit(self.make_request) for _ in range(concurrent_users)]
self.results.extend([f.result() for f in futures])
self.print_results()
def print_results(self):
total = len(self.results)
successful = sum(1 for r in self.results if r["success"])
latencies = [r["latency"] for r in self.results if r["success"]]
print(f"Total requests: {total}")
print(f"Successful: {successful} ({successful/total*100:.1f}%)")
print(f"Mean latency: {statistics.mean(latencies):.2f}ms")
print(f"P95 latency: {sorted(latencies)[int(len(latencies)*0.95)]:.2f}ms")
# Validate SLAs
assert successful / total > 0.99, "Success rate below 99%"
assert statistics.median(latencies) < 500, "Median latency exceeds 500ms"
Production Monitoring and Secrets Management
Track model performance post-deployment with scheduled monitoring pipelines and secure secret handling.
# monitoring-pipeline.yml
schedules:
- cron: "0 */6 * * *" # Every 6 hours
displayName: 'Model Monitoring'
branches:
include:
- main
jobs:
- job: MonitorPerformance
steps:
- task: AzureCLI@2
displayName: 'Check Metrics'
inputs:
scriptType: 'python'
inlineScript: |
from azure.ai.ml import MLClient
metrics = ml_client.online_endpoints.get_metrics(
endpoint_name="llama-prod"
)
p95_latency = metrics["RequestLatency"]["p95"]
error_rate = metrics["ErrorRate"]["average"]
if p95_latency > 1000:
print(f"##vso[task.logissue type=warning]P95 latency {p95_latency}ms exceeds SLA")
Store secrets in Azure Key Vault:
steps:
- task: AzureKeyVault@2
inputs:
azureSubscription: 'Azure-ML-Service-Connection'
KeyVaultName: 'ml-secrets'
SecretsFilter: 'AZURE-ML-API-KEY'
RunAsPreJob: true
Recommended branching strategy for GitFlow with main for production, develop for integration, and feature branches for development. Trigger builds on pull requests to develop. Deploy to dev on develop merge. Deploy to production on main merge. This workflow provides quality gates at each stage.
Conclusion
Azure DevOps MLOps pipelines automate LLM deployment workflows from code commit to production monitoring. Automated testing catches issues early. Multi-stage deployments with manual gates prevent bad releases. Blue-green deployments eliminate downtime. Load testing validates performance at scale. Organizations implementing these patterns reduce deployment time by 60% while improving reliability. Start with the build pipeline, add automated tests, then implement multi-stage deployments with monitoring. Your team ships faster with fewer production incidents.