AI Cloud

Monitor LLM Performance with Azure Insights

Learn how to monitor production LLM deployments on Azure using Azure Monitor, Log Analytics, and Application Insights to detect issues faster, reduce downtime, track GPU and latency metrics, and maintain reliable, high-performance AI services at scale.

The EaseCloud Team

28 Jan 2026 • 5 min read

TLDR;

Complete monitoring reduces downtime by 75% and resolves issues 65% faster
Log Analytics queries identify slowest requests, error patterns, and GPU trends
Application Insights provides distributed tracing across tokenization, inference, and decoding
Budget alerts notify teams when spending exceeds 80% of monthly threshold

Introduction

Production LLM deployments require robust monitoring to maintain reliability and performance. Latency spikes impact user experience. Error rates increase without warning. GPU memory leaks crash instances. Without proactive monitoring, you discover problems only when users complain and business impact accumulates.

Azure Monitor provides end-to-end observability specifically designed for Azure ML and AKS deployments. Track performance metrics automatically collected from managed endpoints. Monitor custom application metrics that matter to your business. Centralize logs with Log Analytics for troubleshooting. Use Application Insights for distributed tracing across services. Set up real-time alerting with action groups that notify teams immediately. Built-in dashboards and workbooks visualize system health at a glance. This guide covers integration points including Azure ML endpoints, AKS clusters, Container Apps, GPU metrics, application telemetry, and cost tracking. Organizations implementing these monitoring patterns detect issues 80% faster, reduce mean time to resolution by 65%, and achieve 99.9% uptime for production LLM services. Essential capabilities include automatic metric collection, custom application metrics, Log Analytics queries, Application Insights tracing, alert rules with thresholds, and actionable dashboards.

Metrics Collection and Configuration

Azure Monitor automatically collects metrics from Azure ML managed endpoints and AKS clusters. Enable diagnostic settings to capture detailed telemetry.

from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

ml_client = MLClient(
DefaultAzureCredential(),
subscription_id="your-subscription-id",
resource_group_name="your-resource-group",
workspace_name="your-workspace"
)

# Enable diagnostic settings
diagnostic_config = {
"logs": [
{
"category": "AmlOnlineEndpointTrafficLog",
"enabled": True
},
{
"category": "AmlOnlineEndpointConsoleLog",
"enabled": True
}
],
"metrics": [
{
"category": "AllMetrics",
"enabled": True
}
],
"workspaceId": "/subscriptions/.../workspaces/ml-logs"
}

Built-in endpoint metrics include RequestLatency for end-to-end request duration, RequestsPerMinute for traffic volume, RequestHttpStatusCode for success and error rates, CpuUtilization for CPU usage percentage, MemoryUtilization for memory usage percentage, and GpuUtilization for GPU usage percentage when applicable.

Track business-specific metrics with custom instrumentation:

from opencensus.ext.azure import metrics_exporter
from opencensus.stats import aggregation, measure, stats, view

# Setup Azure Monitor exporter
exporter = metrics_exporter.new_metrics_exporter(
connection_string="InstrumentationKey=your-key"
)

# Define custom metrics
tokens_generated = measure.MeasureInt(
"tokens_generated",
"Number of tokens generated",
"tokens"
)

quality_score = measure.MeasureFloat(
"quality_score",
"Response quality score",
"score"
)

# Create views and register
stats_recorder = stats.Stats().stats_recorder
view_manager = stats.Stats().view_manager

tokens_view = view.View(
"tokens_generated_view",
"Total tokens generated",
["model", "environment"],
tokens_generated,
aggregation.LastValueAggregation()
)

view_manager.register_view(tokens_view)
view_manager.register_exporter(exporter)

# Record metrics
def track_inference(model_name, tokens, quality):
mmap = stats_recorder.new_measurement_map()
tmap = {"model": model_name, "environment": "production"}
mmap.measure_int_put(tokens_generated, tokens)
mmap.measure_float_put(quality_score, quality)
mmap.record(tmap)

Monitor GPU utilization in AKS deployments:

InsightsMetrics
| where Namespace == "container.azm.ms/gpu"
| where Name == "gpu_utilization"
| summarize avg(Val) by bin(TimeGenerated, 5m), Computer
| render timechart

Dashboards and Visualization

Create dashboards that visualize key metrics for quick health assessment.

Azure Workbook for LLM monitoring:

{
"version": "Notebook/1.0",
"items": [
{
"type": 3,
"content": {
"version": "KqlItem/1.0",
"query": "AmlOnlineEndpointTrafficLog\n| where ResponseCode < 400\n| summarize AvgLatency=avg(TotalLatencyMs),\n P95Latency=percentile(TotalLatencyMs, 95),\n P99Latency=percentile(TotalLatencyMs, 99)\n by bin(TimeGenerated, 5m)\n| render timechart",
"title": "Request Latency P50, P95, P99",
"queryType": 0
}
}
]
}

Create custom metrics dashboard:

from azure.mgmt.dashboard import DashboardManagementClient

dashboard_properties = {
"lenses": {
"0": {
"order": 0,
"parts": {
"0": {
"position": {"x": 0, "y": 0, "colSpan": 6, "rowSpan": 4},
"metadata": {
"type": "Extension/HubsExtension/PartType/MonitorChartPart",
"settings": {
"content": {
"metrics": [{
"resourceId": "/subscriptions/.../endpoints/llama",
"name": "RequestLatency",
"aggregation": "Average"
}]
}
}
}
}
}
}
}
}

Alerting and Automated Response

Configure alerts that notify teams when metrics exceed thresholds.

High latency alert:

from azure.mgmt.monitor import MonitorManagementClient
from azure.mgmt.monitor.models import *

alert_rule = MetricAlertResource(
location="global",
description="Alert when endpoint latency exceeds 1 second",
severity=2,
enabled=True,
scopes=["/subscriptions/.../onlineEndpoints/llama-endpoint"],
evaluation_frequency="PT1M",
window_size="PT5M",
criteria=MetricAlertMultipleResourceMultipleMetricCriteria(
odata_type="Microsoft.Azure.Monitor.MultipleResourceMultipleMetricCriteria",
all_of=[
MetricCriteria(
name="HighLatency",
metric_name="RequestLatency",
operator="GreaterThan",
threshold=1000,
time_aggregation="Average"
)
]
),
actions=[
ActionGroupResource(
action_group_id="/subscriptions/.../actionGroups/ops-alerts"
)
]
)

Error rate alert based on log queries:

scheduled_query_rule = LogSearchRuleResource(
location="eastus",
description="Alert when error rate exceeds 5%",
enabled=True,
source=Source(
query="""
AmlOnlineEndpointTrafficLog
| summarize TotalRequests=count(),
Errors=countif(ResponseCode >= 400)
| extend ErrorRate = (Errors * 100.0) / TotalRequests
| where ErrorRate > 5
""",
data_source_id="/subscriptions/.../workspaces/ml-logs"
),
schedule=Schedule(
frequency_in_minutes=5,
time_window_in_minutes=10
),
action=Action(
severity="2",
azns_action=AzNsActionGroup(
action_group=["/subscriptions/.../actionGroups/critical-alerts"]
),
trigger=TriggerCondition(
threshold_operator="GreaterThan",
threshold=0
)
)
)

Alert on GPU temperature:

gpu_temp_alert = MetricAlertResource(
location="global",
description="Alert when GPU temperature exceeds 80°C",
severity=1,
enabled=True,
scopes=["/subscriptions/.../clusters/llm-cluster"],
evaluation_frequency="PT1M",
window_size="PT5M",
criteria=MetricAlertMultipleResourceMultipleMetricCriteria(
all_of=[
MetricCriteria(
name="HighGPUTemp",
metric_name="gpu_temperature",
operator="GreaterThan",
threshold=80,
time_aggregation="Average"
)
]
)
)

Log Analytics and Troubleshooting

Query logs to identify patterns and debug production issues.

Analyze slowest requests:

Detect error patterns:

Track token usage:

customMetrics
| where name == "tokens_generated"
| extend Model = tostring(customDimensions.model)
| summarize TotalTokens=sum(value),
AvgTokensPerRequest=avg(value)
by Model, bin(timestamp, 1h)
| render timechart

GPU utilization trends:

InsightsMetrics
| where Namespace == "container.azm.ms/gpu"
| where Name in ("gpu_utilization", "gpu_memory_used_bytes")
| extend MetricValue = iif(Name == "gpu_memory_used_bytes", Val / 1024 / 1024 / 1024, Val)
| summarize avg(MetricValue) by Name, Computer, bin(TimeGenerated, 5m)
| render timechart

Application Insights and Cost Monitoring

Implement distributed tracing for complex workflows and track infrastructure spending.

Enable Application Insights tracing:

from opencensus.ext.azure.trace_exporter import AzureExporter
from opencensus.trace.tracer import Tracer

tracer = Tracer(
exporter=AzureExporter(
connection_string="InstrumentationKey=your-key"
)
)

def generate_response(prompt):
with tracer.span(name="inference") as span:
span.add_attribute("prompt_length", len(prompt))

with tracer.span(name="tokenize"):
tokens = tokenizer(prompt)

with tracer.span(name="model_generate"):
output = model.generate(tokens)

with tracer.span(name="decode"):
response = tokenizer.decode(output)

span.add_attribute("response_length", len(response))
return response

Track costs with Azure Monitor:

AzureDiagnostics
| where ResourceProvider == "MICROSOFT.MACHINELEARNINGSERVICES"
| where Category == "ComputeInstanceEvent"
| summarize TotalCost=sum(todouble(CostInUSD)) by bin(TimeGenerated, 1d)
| render columnchart

Create budget alerts:

from azure.mgmt.consumption import ConsumptionManagementClient
from azure.mgmt.consumption.models import *

budget = Budget(
category="Cost",
amount=10000,
time_grain="Monthly",
time_period=BudgetTimePeriod(start_date="2025-01-01"),
notifications={
"Actual_GreaterThan_80_Percent": Notification(
enabled=True,
operator="GreaterThan",
threshold=80,
contact_emails=["ops@company.com"]
)
}
)

Correlate logs across services using correlation IDs:

import uuid

def handle_request(request):
correlation_id = str(uuid.uuid4())

logger.info("Request received", extra={
"correlation_id": correlation_id,
"user_id": request.user_id
})

response = requests.post(
"https://llama-service/generate",
headers={"X-Correlation-ID": correlation_id},
json=request.data
)

return response

Query across services:

union AmlOnlineEndpointTrafficLog, ContainerLog
| where * contains "{correlation_id}"
| project TimeGenerated, ServiceName=tostring(split(_ResourceId, "/")[-1]), Message
| order by TimeGenerated asc

Conclusion

Azure Monitor provides production-grade observability for LLM deployments running on Azure ML and AKS. Automatic metric collection captures endpoint performance without instrumentation. Custom metrics track business-specific KPIs. Log Analytics enables powerful querying for troubleshooting. Application Insights traces requests across distributed services. Alerts notify teams before issues impact users. Organizations implementing complete monitoring reduce downtime by 75% and resolve issues 65% faster. Start with diagnostic settings and built-in metrics. Add custom metrics for business KPIs. Create dashboards for visibility. Configure alerts for critical thresholds. Your production deployment stays healthy and performant.