18.1 Cloud Native Monitoring
Monitoring Machine Learning systems requires a paradigm shift from traditional DevOps monitoring. In standard software, if the HTTP response code is 200 OK and latency is low, the service is “healthy.” In ML systems, a model can be returning 200 OK with sub-millisecond latency while serving completely garbage predictions that cost the business millions.
This section covers the foundational layer: System and Application Monitoring using the native tools provided by the major clouds (AWS and GCP). We will explore the separation of concerns between Infrastructure Monitoring (L1) and Application Monitoring (L2), and how to properly instrument an inference container.
1. The Pyramid of Observability
We can view ML observability as a three-layer stack. You cannot fix L3 if L1 is broken.
- Infrastructure (L1): Is the server running? Is the GPU overheating?
- Metrics: CPU, RAM, Disk I/O, Network I/O, GPU Temperature, GPU Utilization.
- Tools: CloudWatch, Stackdriver, Node Exporter.
- Application (L2): Is the inference server healthy?
- Metrics: Latency (P50/P99), Throughput (RPS), Error Rate (HTTP 5xx), Queue Depth, Batch Size.
- Tools: Application Logs, Prometheus Custom Metrics.
- Data & Model (L3): Is the math correct?
- Metrics: Prediction Drift, Feature Skew, Confidence Distribution, Fairness.
- Tools: SageMaker Model Monitor, Vertex AI Monitoring, Evidently AI. (Covered in 18.3)
2. AWS CloudWatch: The Deep Dive
Amazon CloudWatch is the pervasive observability fabric of AWS. It is often misunderstood as “just a place where logs go,” but it is a powerful metric aggregation engine.
2.1. Metrics, Namespaces, and Dimensions
Understanding the data model is critical to avoiding high costs and confusing dashboards.
- Namespace: A container for metrics (e.g.,
AWS/SageMakerorMyApp/Production). - Metric Name: The implementation variable (e.g.,
ModelLatency). - Dimension: Name/Value pairs used to filter the metric (e.g.,
EndpointName = 'fraud-detector-v1',Variant = 'Production').
The Cardinality Trap: A common MLOps mistake is to include high-cardinality data in dimensions.
- Bad Idea: Including
UserIDorRequestIDas a dimension. - Result: CloudWatch creates a separate metric series for every single user. Your bill will explode, and the dashboard will be unreadable.
- Rule: Dimensions are for Infrastructure Topology (Region, InstanceType, ModelVersion), not for data content.
2.2. Embedded Metric Format (EMF)
Emitting custom metrics usually involves an API call (PutMetricData), which is slow (HTTP request) and expensive.
EMF allows you to emit metrics as logs. The CloudWatch agent parses the logs asynchronously and creates the metrics for you.
Implementation in Python:
from aws_embedded_metrics import metric_scope
@metric_scope
def inference_handler(event, context, metrics):
metrics.set_namespace("MLOps/FraudDetection")
metrics.put_dimensions({"ModelVersion": "v2.1"})
start_time = time.time()
# ... Run Inference ...
latency = (time.time() - start_time) * 1000
probability = prediction[0]
# Emit Metrics
metrics.put_metric("InferenceLatency", latency, "Milliseconds")
metrics.put_metric("FraudProbability_Sum", probability, "None")
# Also logs high-cardinality data as properties (Not Dimensions!)
metrics.set_property("RequestId", context.aws_request_id)
metrics.set_property("UserId", event['user_id'])
return {"probability": probability}
2.3. Standard SageMaker Metrics
When you deploy a standard SageMaker Endpoint, AWS emits critical metrics automatically to the AWS/SageMaker namespace:
| Metric | Meaning | Debugging Use Case |
|---|---|---|
| ModelLatency | Time taken by your container code (Flask/TorchServe). | If high, optimize your model (Chapter 11) or code. |
| OverheadLatency | Time added by AWS (Network + Auth + Queuing). | If high (>100ms) but ModelLatency is low, you have a client-side network issue or you are hitting the TPS limit of the instance type (Network saturation). |
| Invocations | Total requests. | Sudden drop to zero? Check upstream client health. |
| Invocation5XX | Server-side errors (Code Crash). | Check logs for stack traces. |
| Invocation4XX | Client-side errors (Bad payload). | Check if client is sending image/png when model expects application/json. |
| CPUUtilization / MemoryUtilization | Compute health. | If Memory > 90%, you are at risk of OOM Kill. |
2.4. Infrastructure as Code: Alerting (Terraform)
You should define your alerts in code, not in the console.
resource "aws_cloudwatch_metric_alarm" "high_latency" {
alarm_name = "High_Latency_Alarm_FraudModel"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = "3"
metric_name = "ModelLatency"
namespace = "AWS/SageMaker"
period = "60"
statistic = "p99"
threshold = "500000" # 500ms (SageMaker invokes are in microseconds!)
alarm_description = "This metric monitors endpoint latency"
dimensions = {
EndpointName = "fraud-detector-prod"
VariantName = "AllTraffic"
}
alarm_actions = [aws_sns_topic.pagerduty.arn]
}
3. GCP Cloud Monitoring (Stackdriver)
Google Cloud Operations Suite (formerly Stackdriver) integrates deeply with GKE and Vertex AI.
3.1. The Google SRE “Golden Signals”
Google SRE methodology emphasizes four signals that define service health. Every dashboard should be anchored on these.
- Latency: The time it takes to service a request.
- Metric:
request_latency_seconds_bucket(Histogram). - Visualization: Heatmaps are better than averages.
- Metric:
- Traffic: A measure of how much demand is being placed on the system.
- Metric:
requests_per_second.
- Metric:
- Errors: The rate of requests that fail.
- Metric:
response_statuscodes. - Crucial: Distinguish between “Explicit” errors (500) and “Implicit” errors (200 OK but content is empty).
- Metric:
- Saturation: How “full” is your service?
- Metric: GPU Duty Cycle, Memory Usage, or Thread Pool queue depth.
- Action: Saturation metrics drive Auto-scaling triggers.
3.2. Practical GKE Monitoring: The Sidecar Pattern
Model servers like TensorFlow Serving (TFS) or TorchServe emit Prometheus-formatted metrics by default. How do we get them into GCP Monitoring?
- Pattern: Run a “Prometheus Sidecar” in the same Pod as the inference container.
Kubernetes Deployment YAML:
apiVersion: apps/v1
kind: Deployment
metadata:
name: tf-serving
spec:
replicas: 3
template:
spec:
containers:
# 1. The Inference Container
- name: tf-serving
image: tensorflow/serving:latest-gpu
ports:
- containerPort: 8501 # REST
- containerPort: 8502 # Monitoring
env:
- name: MONITORING_CONFIG
value: "/config/monitoring_config.txt"
# 2. The Sidecar (OpenTelemetry Collector)
- name: otel-collector
image: otel/opentelemetry-collector-contrib:latest
command: ["--config=/etc/otel-collector-config.yaml"]
volumeMounts:
- name: otel-config
mountPath: /etc/otel-collector-config.yaml
subPath: config.yaml
Sidecar Config (otel-config.yaml):
receivers:
prometheus:
config:
scrape_configs:
- job_name: 'tf-serving'
scrape_interval: 10s
static_configs:
- targets: ['localhost:8502']
exporters:
googlecloud:
project: my-gcp-project
service:
pipelines:
metrics:
receivers: [prometheus]
exporters: [googlecloud]
3.3. Distributed Tracing (AWS X-Ray / Cloud Trace)
When you have a chain of models (Pipeline), metrics are not enough. You need traces.
- Scenario: User uploads image -> Preprocessing (Lambda) -> Embedding Model (SageMaker) -> Vector Search (OpenSearch) -> Re-ranking (SageMaker) -> Response.
- Problem: Total latency is 2s. Who is slow?
- Solution: Pass a
Trace-IDheader through every hop.
Python Middleware Example:
from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.ext.flask.middleware import XRayMiddleware
app = Flask(__name__)
# Instruments the Flask app to accept 'X-Amzn-Trace-Id' headers
XRayMiddleware(app, xray_recorder)
@app.route('/predict', methods=['POST'])
def predict():
# Start a subsegment for the expensive part
with xray_recorder.in_subsegment('ModelInference'):
model_output = run_heavy_inference_code()
return jsonify(model_output)
4. Dashboarding Methodology
The goal of a dashboard is to answer questions, not to look pretty.
4.1. The “Morning Coffee” Dashboard
Audience: Managers / Lead Engineers. Scope: High level health.
- Global Traffic: Total RPS across all regions.
- Global Valid Request Rate: % of 200 OK.
- Cost: Estimated daily spend (GPU hours).
4.2. The “Debug” Dashboard
Audience: On-call Engineers. Scope: Per-instance granularity.
- Latency Heatmap: Visualize the distribution of latency. Can you see a bi-modal distribution? (Fast cache hits vs slow DB lookups).
- Memory Leak Tracker: Slope of Memory Usage over 24 hours.
- Thread Count: Is the application blocked on I/O?
5. Alerting Strategies: Signals vs. Noise
The goal of alerting is Actionability. If an alert fires and the engineer just deletes the email, that alert is technical debt.
5.1. Symptom-based Alerting
Alert on the symptom (User pain), not the cause.
- Bad Alert: “CPU usage > 90%”.
- Why? Maybe the CPU is effectively using resources! If latency is fine, 90% CPU is good ROI.
- Good Alert: “P99 Latency > 500ms”.
- Why? The user is suffering. Now the engineer investigates why (maybe it’s CPU, maybe it’s Network).
5.2. Low Throughput Anomaly (The Silent Failure)
What if the system stops receiving requests?
- Standard “Threshold” alert (
InvocationCount < 10) fails because low traffic is normal at 3 AM. - Solution: CloudWatch Anomaly Detection.
- It uses a Random Cut Forest (ML algorithm) to learn the daily/weekly seasonality of your metric.
- It creates a dynamic “band” of expected values.
- Alert: “If Invocations is outside the expected band” (Lower than expected for this time of day).
5.3. Severity Levels
- SEV-1 (PagerDuty): The system is down or hurting customers.
- Examples: Endpoint 5xx rate > 1%, Latency P99 > 2s, OOM Loop.
- Response: Immediate wake up (24/7).
- SEV-2 (Ticket): The system is degrading or showing signs of future failure.
- Examples: Single GPU failure (redundancy handling it), Disk 80% full, Latency increasing slowly.
- Response: Fix during business hours.
In the next section, we dig deeper into the specific hardware metrics of the GPU that drive those Saturation signals.
6. Complete Prometheus Setup for ML Services
6.1. Instrumenting a Python Inference Server
# inference_server.py
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
import numpy as np
# Define metrics
INFERENCE_COUNTER = Counter(
'ml_inference_requests_total',
'Total inference requests',
['model_version', 'status']
)
INFERENCE_LATENCY = Histogram(
'ml_inference_latency_seconds',
'Inference latency in seconds',
['model_version'],
buckets=[0.001, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)
MODEL_CONFIDENCE = Histogram(
'ml_model_confidence_score',
'Model prediction confidence',
['model_version'],
buckets=[0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99, 1.0]
)
ACTIVE_REQUESTS = Gauge(
'ml_active_requests',
'Number of requests currently being processed'
)
class MLInferenceServer:
def __init__(self, model, model_version="v1.0"):
self.model = model
self.model_version = model_version
def predict(self, input_data):
ACTIVE_REQUESTS.inc()
try:
start_time = time.time()
# Run inference
prediction = self.model.predict(input_data)
confidence = float(np.max(prediction))
# Record metrics
latency = time.time() - start_time
INFERENCE_LATENCY.labels(model_version=self.model_version).observe(latency)
MODEL_CONFIDENCE.labels(model_version=self.model_version).observe(confidence)
INFERENCE_COUNTER.labels(
model_version=self.model_version,
status='success'
).inc()
return {
'prediction': prediction.tolist(),
'confidence': confidence,
'latency_ms': latency * 1000
}
except Exception as e:
INFERENCE_COUNTER.labels(
model_version=self.model_version,
status='error'
).inc()
raise
finally:
ACTIVE_REQUESTS.dec()
# Start Prometheus metrics endpoint
if __name__ == "__main__":
start_http_server(8000) # Metrics available at :8000/metrics
print("Prometheus metrics server started on :8000")
# ... rest of Flask/FastAPI server code
6.2. Prometheus Scrape Configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'ml-prod-us-east-1'
scrape_configs:
# SageMaker endpoints
- job_name: 'sagemaker-endpoints'
static_configs:
- targets:
- 'fraud-detector:8080'
- 'recommendation-engine:8080'
relabel_configs:
- source_labels: [__address__]
target_label: endpoint_name
regex: '([^:]+):.*'
# Custom model servers
- job_name: 'custom-ml-servers'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- ml-production
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: ml-inference-server
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod_name
- source_labels: [__meta_kubernetes_pod_label_model_version]
target_label: model_version
# Node exporter (infrastructure metrics)
- job_name: 'node-exporter'
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
# Alert rules
rule_files:
- '/etc/prometheus/alerts/*.yml'
6.3. Alert Rules
# alerts/ml_service_alerts.yml
groups:
- name: ml_inference_alerts
interval: 30s
rules:
# High error rate
- alert: HighInferenceErrorRate
expr: |
rate(ml_inference_requests_total{status="error"}[5m])
/
rate(ml_inference_requests_total[5m])
> 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.model_version }}"
description: "Error rate is {{ $value | humanizePercentage }} over the last 5 minutes"
# High latency
- alert: HighInferenceLatency
expr: |
histogram_quantile(0.99,
rate(ml_inference_latency_seconds_bucket[5m])
) > 1.0
for: 10m
labels:
severity: warning
annotations:
summary: "High P99 latency on {{ $labels.model_version }}"
description: "P99 latency is {{ $value }}s"
# Low confidence predictions
- alert: LowModelConfidence
expr: |
histogram_quantile(0.50,
rate(ml_model_confidence_score_bucket[1h])
) < 0.7
for: 30m
labels:
severity: warning
annotations:
summary: "Model confidence degrading on {{ $labels.model_version }}"
description: "Median confidence is {{ $value }}, may indicate drift"
# Service down
- alert: InferenceServiceDown
expr: up{job="custom-ml-servers"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Inference service {{ $labels.pod_name }} is down"
description: "Pod has been down for more than 5 minutes"
7. CloudWatch Insights Query Library
7.1. Finding Slowest Requests
-- CloudWatch Insights Query
-- Find the slowest 10 requests in the last hour
fields @timestamp, requestId, modelLatency, userId
| filter modelLatency > 1000 # More than 1 second
| sort modelLatency desc
| limit 10
7.2. Error Rate by Model Version
fields @timestamp, modelVersion, statusCode
| stats count() as total,
sum(statusCode >= 500) as errors
by modelVersion
| fields modelVersion,
errors / total * 100 as error_rate_percent
| sort error_rate_percent desc
7.3. Latency Percentiles Over Time
fields @timestamp, modelLatency
| filter modelVersion = "v2.1"
| stats pct(modelLatency, 50) as p50,
pct(modelLatency, 90) as p90,
pct(modelLatency, 99) as p99
by bin(5m)
7.4. Anomaly Detection Query
# Find hours where request count deviated >2 stddev from average
fields @timestamp
| stats count() as request_count by bin(1h)
| stats avg(request_count) as avg_requests,
stddev(request_count) as stddev_requests
| filter abs(request_count - avg_requests) > 2 * stddev_requests
8. Defining SLIs and SLOs for ML Systems
8.1. SLI (Service Level Indicators) Examples
| SLI | Query | Good Target |
|---|---|---|
| Availability | sum(successful_requests) / sum(total_requests) | 99.9% |
| Latency | P99(inference_latency_ms) | < 200ms |
| Freshness | now() - last_model_update_timestamp | < 7 days |
| Quality | avg(model_confidence) | > 0.85 |
8.2. SLO Definition (YAML)
# slo_definitions.yml
apiVersion: monitoring.google.com/v1
kind: ServiceLevelObjective
metadata:
name: fraud-detector-availability
spec:
displayName: "Fraud Detector 99.9% Availability"
serviceLevelIndicator:
requestBased:
goodTotalRatio:
goodServiceFilter: |
metric.type="custom.googleapis.com/inference/requests"
metric.label.status="success"
totalServiceFilter: |
metric.type="custom.googleapis.com/inference/requests"
goal: 0.999
rollingPeriod: 2592000s # 30 days
8.3. Error Budget Calculation
# error_budget_calculator.py
from dataclasses import dataclass
from datetime import datetime, timedelta
@dataclass
class SLO:
name: str
target: float # e.g., 0.999 for 99.9%
window_days: int
class ErrorBudgetCalculator:
def __init__(self, slo: SLO):
self.slo = slo
def calculate_budget(self, total_requests: int, failed_requests: int):
"""
Calculate remaining error budget.
"""
# Current availability
current_availability = (total_requests - failed_requests) / total_requests
# Allowed failures
allowed_failures = total_requests * (1 - self.slo.target)
# Budget remaining
budget_remaining = allowed_failures - failed_requests
budget_percent = (budget_remaining / allowed_failures) * 100
# Time to exhaustion
failure_rate = failed_requests / total_requests
if failure_rate > (1 - self.slo.target):
# Burning budget
time_to_exhaustion = self.estimate_exhaustion_time(
budget_remaining,
failure_rate,
total_requests
)
else:
time_to_exhaustion = None
return {
'slo_target': self.slo.target,
'current_availability': current_availability,
'budget_remaining': budget_remaining,
'budget_percent': budget_percent,
'status': 'healthy' if budget_percent > 10 else 'critical',
'time_to_exhaustion_hours': time_to_exhaustion
}
def estimate_exhaustion_time(self, budget_remaining, failure_rate, total_requests):
# Simplified linear projection
failures_per_hour = failure_rate * (total_requests / self.slo.window_days / 24)
return budget_remaining / failures_per_hour if failures_per_hour > 0 else None
# Usage
slo = SLO(name="Fraud Detector", target=0.999, window_days=30)
calculator = ErrorBudgetCalculator(slo)
result = calculator.calculate_budget(
total_requests=1000000,
failed_requests=1500
)
print(f"Error budget remaining: {result['budget_percent']:.1f}%")
if result['time_to_exhaustion_hours']:
print(f"⚠️ Budget will be exhausted in {result['time_to_exhaustion_hours']:.1f} hours!")
9. Incident Response Playbooks
9.1. Runbook: High Latency Incident
# Runbook: ML Inference High Latency
## Trigger
- P99 latency > 500ms for 10+ minutes
- Alert: `HighInferenceLatency` fires
## Severity
**SEV-2** (Degraded service, users experiencing slowness)
## Investigation Steps
### 1. Check if it's a global issue
```bash
# CloudWatch
aws cloudwatch get-metric-statistics \
--namespace AWS/SageMaker \
--metric-name ModelLatency \
--dimensions Name=EndpointName,Value=fraud-detector-prod \
--statistics Average,p99 \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300
If all regions are slow → Likely model issue or infrastructure If single region → Network or regional infrastructure
2. Check for deployment changes
# Check recent deployments in last 2 hours
aws sagemaker list-endpoint-configs \
--creation-time-after $(date -u -d '2 hours ago' +%Y-%m-%dT%H:%M:%S) \
--sort-by CreationTime
Recent deployment? → Potential regression, consider rollback
3. Check instance health
# CPU/Memory utilization
aws cloudwatch get-metric-statistics \
--namespace AWS/SageMaker \
--metric-name CPUUtilization \
--dimensions Name=EndpointName,Value=fraud-detector-prod
CPU > 90%? → Scale out (increase instance count) Memory > 85%? → Risk of OOM, check for memory leak
4. Check input data characteristics
# Sample recent requests, check input size distribution
import boto3
s3 = boto3.client('s3')
# Download last 100 captured requests
for key in recent_keys:
request = json.load(s3.get_object(Bucket=bucket, Key=key))
print(f"Input size: {len(request['features'])} features")
Unusual input sizes? → May indicate upstream data corruption
Mitigation Options
Option A: Scale Out (Increase Instances)
aws sagemaker update-endpoint \
--endpoint-name fraud-detector-prod \
--endpoint-config-name fraud-detector-config-scaled
ETA: 5-10 minutes Risk: Low
Option B: Rollback to Previous Version
aws sagemaker update-endpoint \
--endpoint-name fraud-detector-prod \
--endpoint-config-name fraud-detector-config-v1.9-stable \
--retain-deployment-config
ETA: 3-5 minutes Risk: Medium (may reintroduce old bugs)
Option C: Enable Caching
# If latency is due to repeated similar requests
# Add Redis cache in front of SageMaker
ETA: 30 minutes (code deploy) Risk: Medium (cache invalidation complexity)
Post-Incident Review
- Document root cause
- Update alerts if false positive
- Add monitoring for specific failure mode
### 9.2. Runbook: Model Accuracy Degradation
```markdown
# Runbook: Model Accuracy Degradation
## Trigger
- Business metrics show increased fraud escapes
- Median prediction confidence < 0.7
## Investigation
### 1. Compare recent vs baseline predictions
```python
# Pull samples from production
recent_predictions = get_predictions(hours=24)
baseline_predictions = load_validation_set_predictions()
# Compare distributions
from scipy.stats import ks_2samp
statistic, p_value = ks_2samp(
recent_predictions['confidence'],
baseline_predictions['confidence']
)
if p_value < 0.05:
print("⚠️ Significant distribution shift detected")
2. Check for data drift
→ See Chapter 18.3 for detailed drift analysis
3. Check model version
# Verify correct model is deployed
aws sagemaker describe-endpoint --endpoint-name fraud-detector-prod \
| jq '.ProductionVariants[0].DeployedImages[0].SpecifiedImage'
Mitigation
- Trigger retraining pipeline
- Deploy shadow model with recent data
- Consider fallback to rule-based system temporarily
---
## 10. Monitoring Automation Scripts
### 10.1. Auto-Scaling Based on Queue Depth
```python
# autoscaler.py
import boto3
import time
cloudwatch = boto3.client('cloudwatch')
sagemaker = boto3.client('sagemaker')
def get_queue_depth(endpoint_name):
response = cloudwatch.get_metric_statistics(
Namespace='AWS/SageMaker',
MetricName='OverheadLatency',
Dimensions=[{'Name': 'EndpointName', 'Value': endpoint_name}],
StartTime=datetime.utcnow() - timedelta(minutes=5),
EndTime=datetime.utcnow(),
Period=300,
Statistics=['Average']
)
return response['Datapoints'][0]['Average'] if response['Datapoints'] else 0
def scale_endpoint(endpoint_name, target_instance_count):
# Get current config
endpoint = sagemaker.describe_endpoint(EndpointName=endpoint_name)
current_config = endpoint['EndpointConfigName']
# Create new config with updated instance count
new_config_name = f"{endpoint_name}-scaled-{int(time.time())}"
# ... create new endpoint config with target_instance_count ...
# Update endpoint
sagemaker.update_endpoint(
EndpointName=endpoint_name,
EndpointConfigName=new_config_name
)
print(f"Scaling {endpoint_name} to {target_instance_count} instances")
def autoscale_loop():
while True:
queue_depth = get_queue_depth('fraud-detector-prod')
if queue_depth > 100: # Queue building up
scale_endpoint('fraud-detector-prod', current_count + 1)
elif queue_depth < 10 and current_count > 1: # Under-utilized
scale_endpoint('fraud-detector-prod', current_count - 1)
time.sleep(60) # Check every minute
if __name__ == "__main__":
autoscale_loop()
10.2. Health Check Daemon
# health_checker.py
import requests
import time
from datetime import datetime
ENDPOINTS = [
{'name': 'fraud-detector', 'url': 'https://api.company.com/v1/fraud/predict'},
{'name': 'recommendation', 'url': 'https://api.company.com/v1/recommend'},
]
def health_check(endpoint):
try:
start = time.time()
response = requests.post(
endpoint['url'],
json={'dummy': 'data'},
timeout=5
)
latency = (time.time() - start) * 1000
return {
'endpoint': endpoint['name'],
'status': 'healthy' if response.status_code == 200 else 'unhealthy',
'latency_ms': latency,
'timestamp': datetime.utcnow().isoformat()
}
except Exception as e:
return {
'endpoint': endpoint['name'],
'status': 'error',
'error': str(e),
'timestamp': datetime.utcnow().isoformat()
}
def monitor_loop():
while True:
for endpoint in ENDPOINTS:
result = health_check(endpoint)
# Push to monitoring system
publish_metric(result)
if result['status'] != 'healthy':
send_alert(result)
time.sleep(30) # Check every 30 seconds
if __name__ == "__main__":
monitor_loop()
11. Cost Monitoring for ML Infrastructure
11.1. Cost Attribution by Model
# cost_tracker.py
import boto3
from datetime import datetime, timedelta
ce = boto3.client('ce') # Cost Explorer
def get_ml_costs(start_date, end_date):
response = ce.get_cost_and_usage(
TimePeriod={
'Start': start_date,
'End': end_date
},
Granularity='DAILY',
Filter={
'Dimensions': {
'Key': 'SERVICE',
'Values': ['Amazon SageMaker']
}
},
Metrics=['UnblendedCost'],
GroupBy=[
{'Type': 'TAG', 'Key': 'ModelName'},
{'Type': 'DIMENSION', 'Key': 'INSTANCE_TYPE'}
]
)
costs = {}
for result in response['ResultsByTime']:
date = result['TimePeriod']['Start']
for group in result['Groups']:
model_name = group['Keys'][0]
instance_type = group['Keys'][1]
cost = float(group['Metrics']['UnblendedCost']['Amount'])
if model_name not in costs:
costs[model_name] = {}
costs[model_name][instance_type] = costs[model_name].get(instance_type, 0) + cost
return costs
# Generate weekly report
costs = get_ml_costs(
(datetime.now() - timedelta(days=7)).strftime('%Y-%m-%d'),
datetime.now().strftime('%Y-%m-%d')
)
print("Weekly ML Infrastructure Costs:")
for model, instances in costs.items():
total = sum(instances.values())
print(f"\n{model}: ${total:.2f}")
for instance, cost in instances.items():
print(f" {instance}: ${cost:.2f}")
12. Conclusion
Monitoring ML systems is fundamentally different from monitoring traditional software. The metrics that matter most—model quality, prediction confidence, data drift—are domain-specific and require custom instrumentation.
Key takeaways:
- Layer your observability: Infrastructure → Application → Model
- Alert on symptoms, not causes: Users don’t care if CPU is high, they care if latency is high
- Automate everything: From alerts to scaling to incident response
- Monitor costs: GPU time is expensive, track it like you track errors
In the next section, we go even deeper into GPU-specific observability, exploring DCGM and how to truly understand what’s happening on the silicon.