Keyboard shortcuts

Press or to navigate between chapters

Press ? to show this help

Press Esc to hide this help

18.1 Cloud Native Monitoring

Monitoring Machine Learning systems requires a paradigm shift from traditional DevOps monitoring. In standard software, if the HTTP response code is 200 OK and latency is low, the service is “healthy.” In ML systems, a model can be returning 200 OK with sub-millisecond latency while serving completely garbage predictions that cost the business millions.

This section covers the foundational layer: System and Application Monitoring using the native tools provided by the major clouds (AWS and GCP). We will explore the separation of concerns between Infrastructure Monitoring (L1) and Application Monitoring (L2), and how to properly instrument an inference container.


1. The Pyramid of Observability

We can view ML observability as a three-layer stack. You cannot fix L3 if L1 is broken.

  1. Infrastructure (L1): Is the server running? Is the GPU overheating?
    • Metrics: CPU, RAM, Disk I/O, Network I/O, GPU Temperature, GPU Utilization.
    • Tools: CloudWatch, Stackdriver, Node Exporter.
  2. Application (L2): Is the inference server healthy?
    • Metrics: Latency (P50/P99), Throughput (RPS), Error Rate (HTTP 5xx), Queue Depth, Batch Size.
    • Tools: Application Logs, Prometheus Custom Metrics.
  3. Data & Model (L3): Is the math correct?
    • Metrics: Prediction Drift, Feature Skew, Confidence Distribution, Fairness.
    • Tools: SageMaker Model Monitor, Vertex AI Monitoring, Evidently AI. (Covered in 18.3)

2. AWS CloudWatch: The Deep Dive

Amazon CloudWatch is the pervasive observability fabric of AWS. It is often misunderstood as “just a place where logs go,” but it is a powerful metric aggregation engine.

2.1. Metrics, Namespaces, and Dimensions

Understanding the data model is critical to avoiding high costs and confusing dashboards.

  • Namespace: A container for metrics (e.g., AWS/SageMaker or MyApp/Production).
  • Metric Name: The implementation variable (e.g., ModelLatency).
  • Dimension: Name/Value pairs used to filter the metric (e.g., EndpointName = 'fraud-detector-v1', Variant = 'Production').

The Cardinality Trap: A common MLOps mistake is to include high-cardinality data in dimensions.

  • Bad Idea: Including UserID or RequestID as a dimension.
  • Result: CloudWatch creates a separate metric series for every single user. Your bill will explode, and the dashboard will be unreadable.
  • Rule: Dimensions are for Infrastructure Topology (Region, InstanceType, ModelVersion), not for data content.

2.2. Embedded Metric Format (EMF)

Emitting custom metrics usually involves an API call (PutMetricData), which is slow (HTTP request) and expensive. EMF allows you to emit metrics as logs. The CloudWatch agent parses the logs asynchronously and creates the metrics for you.

Implementation in Python:

from aws_embedded_metrics import metric_scope

@metric_scope
def inference_handler(event, context, metrics):
    metrics.set_namespace("MLOps/FraudDetection")
    metrics.put_dimensions({"ModelVersion": "v2.1"})
    
    start_time = time.time()
    # ... Run Inference ...
    latency = (time.time() - start_time) * 1000
    
    probability = prediction[0]
    
    # Emit Metrics
    metrics.put_metric("InferenceLatency", latency, "Milliseconds")
    metrics.put_metric("FraudProbability_Sum", probability, "None")
    
    # Also logs high-cardinality data as properties (Not Dimensions!)
    metrics.set_property("RequestId", context.aws_request_id)
    metrics.set_property("UserId", event['user_id'])
    
    return {"probability": probability}

2.3. Standard SageMaker Metrics

When you deploy a standard SageMaker Endpoint, AWS emits critical metrics automatically to the AWS/SageMaker namespace:

MetricMeaningDebugging Use Case
ModelLatencyTime taken by your container code (Flask/TorchServe).If high, optimize your model (Chapter 11) or code.
OverheadLatencyTime added by AWS (Network + Auth + Queuing).If high (>100ms) but ModelLatency is low, you have a client-side network issue or you are hitting the TPS limit of the instance type (Network saturation).
InvocationsTotal requests.Sudden drop to zero? Check upstream client health.
Invocation5XXServer-side errors (Code Crash).Check logs for stack traces.
Invocation4XXClient-side errors (Bad payload).Check if client is sending image/png when model expects application/json.
CPUUtilization / MemoryUtilizationCompute health.If Memory > 90%, you are at risk of OOM Kill.

2.4. Infrastructure as Code: Alerting (Terraform)

You should define your alerts in code, not in the console.

resource "aws_cloudwatch_metric_alarm" "high_latency" {
  alarm_name          = "High_Latency_Alarm_FraudModel"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "3"
  metric_name         = "ModelLatency"
  namespace           = "AWS/SageMaker"
  period              = "60"
  statistic           = "p99"
  threshold           = "500000" # 500ms (SageMaker invokes are in microseconds!)
  alarm_description   = "This metric monitors endpoint latency"
  
  dimensions = {
    EndpointName = "fraud-detector-prod"
    VariantName  = "AllTraffic"
  }

  alarm_actions = [aws_sns_topic.pagerduty.arn]
}

3. GCP Cloud Monitoring (Stackdriver)

Google Cloud Operations Suite (formerly Stackdriver) integrates deeply with GKE and Vertex AI.

3.1. The Google SRE “Golden Signals”

Google SRE methodology emphasizes four signals that define service health. Every dashboard should be anchored on these.

  1. Latency: The time it takes to service a request.
    • Metric: request_latency_seconds_bucket (Histogram).
    • Visualization: Heatmaps are better than averages.
  2. Traffic: A measure of how much demand is being placed on the system.
    • Metric: requests_per_second.
  3. Errors: The rate of requests that fail.
    • Metric: response_status codes.
    • Crucial: Distinguish between “Explicit” errors (500) and “Implicit” errors (200 OK but content is empty).
  4. Saturation: How “full” is your service?
    • Metric: GPU Duty Cycle, Memory Usage, or Thread Pool queue depth.
    • Action: Saturation metrics drive Auto-scaling triggers.

3.2. Practical GKE Monitoring: The Sidecar Pattern

Model servers like TensorFlow Serving (TFS) or TorchServe emit Prometheus-formatted metrics by default. How do we get them into GCP Monitoring?

  • Pattern: Run a “Prometheus Sidecar” in the same Pod as the inference container.

Kubernetes Deployment YAML:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tf-serving
spec:
  replicas: 3
  template:
    spec:
      containers:
      # 1. The Inference Container
      - name: tf-serving
        image: tensorflow/serving:latest-gpu
        ports:
        - containerPort: 8501 # REST
        - containerPort: 8502 # Monitoring
        env:
        - name: MONITORING_CONFIG
          value: "/config/monitoring_config.txt"
          
      # 2. The Sidecar (OpenTelemetry Collector)
      - name: otel-collector
        image: otel/opentelemetry-collector-contrib:latest
        command: ["--config=/etc/otel-collector-config.yaml"]
        volumeMounts:
        - name: otel-config
          mountPath: /etc/otel-collector-config.yaml
          subPath: config.yaml

Sidecar Config (otel-config.yaml):

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: 'tf-serving'
          scrape_interval: 10s
          static_configs:
            - targets: ['localhost:8502']

exporters:
  googlecloud:
    project: my-gcp-project

service:
  pipelines:
    metrics:
      receivers: [prometheus]
      exporters: [googlecloud]

3.3. Distributed Tracing (AWS X-Ray / Cloud Trace)

When you have a chain of models (Pipeline), metrics are not enough. You need traces.

  • Scenario: User uploads image -> Preprocessing (Lambda) -> Embedding Model (SageMaker) -> Vector Search (OpenSearch) -> Re-ranking (SageMaker) -> Response.
  • Problem: Total latency is 2s. Who is slow?
  • Solution: Pass a Trace-ID header through every hop.

Python Middleware Example:

from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.ext.flask.middleware import XRayMiddleware

app = Flask(__name__)

# Instruments the Flask app to accept 'X-Amzn-Trace-Id' headers
XRayMiddleware(app, xray_recorder)

@app.route('/predict', methods=['POST'])
def predict():
    # Start a subsegment for the expensive part
    with xray_recorder.in_subsegment('ModelInference'):
        model_output = run_heavy_inference_code()
        
    return jsonify(model_output)

4. Dashboarding Methodology

The goal of a dashboard is to answer questions, not to look pretty.

4.1. The “Morning Coffee” Dashboard

Audience: Managers / Lead Engineers. Scope: High level health.

  1. Global Traffic: Total RPS across all regions.
  2. Global Valid Request Rate: % of 200 OK.
  3. Cost: Estimated daily spend (GPU hours).

4.2. The “Debug” Dashboard

Audience: On-call Engineers. Scope: Per-instance granularity.

  1. Latency Heatmap: Visualize the distribution of latency. Can you see a bi-modal distribution? (Fast cache hits vs slow DB lookups).
  2. Memory Leak Tracker: Slope of Memory Usage over 24 hours.
  3. Thread Count: Is the application blocked on I/O?

5. Alerting Strategies: Signals vs. Noise

The goal of alerting is Actionability. If an alert fires and the engineer just deletes the email, that alert is technical debt.

5.1. Symptom-based Alerting

Alert on the symptom (User pain), not the cause.

  • Bad Alert: “CPU usage > 90%”.
    • Why? Maybe the CPU is effectively using resources! If latency is fine, 90% CPU is good ROI.
  • Good Alert: “P99 Latency > 500ms”.
    • Why? The user is suffering. Now the engineer investigates why (maybe it’s CPU, maybe it’s Network).

5.2. Low Throughput Anomaly (The Silent Failure)

What if the system stops receiving requests?

  • Standard “Threshold” alert (InvocationCount < 10) fails because low traffic is normal at 3 AM.
  • Solution: CloudWatch Anomaly Detection.
    • It uses a Random Cut Forest (ML algorithm) to learn the daily/weekly seasonality of your metric.
    • It creates a dynamic “band” of expected values.
    • Alert: “If Invocations is outside the expected band” (Lower than expected for this time of day).

5.3. Severity Levels

  1. SEV-1 (PagerDuty): The system is down or hurting customers.
    • Examples: Endpoint 5xx rate > 1%, Latency P99 > 2s, OOM Loop.
    • Response: Immediate wake up (24/7).
  2. SEV-2 (Ticket): The system is degrading or showing signs of future failure.
    • Examples: Single GPU failure (redundancy handling it), Disk 80% full, Latency increasing slowly.
    • Response: Fix during business hours.

In the next section, we dig deeper into the specific hardware metrics of the GPU that drive those Saturation signals.


6. Complete Prometheus Setup for ML Services

6.1. Instrumenting a Python Inference Server

# inference_server.py
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
import numpy as np

# Define metrics
INFERENCE_COUNTER = Counter(
    'ml_inference_requests_total',
    'Total inference requests',
    ['model_version', 'status']
)

INFERENCE_LATENCY = Histogram(
    'ml_inference_latency_seconds',
    'Inference latency in seconds',
    ['model_version'],
    buckets=[0.001, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)

MODEL_CONFIDENCE = Histogram(
    'ml_model_confidence_score',
    'Model prediction confidence',
    ['model_version'],
    buckets=[0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99, 1.0]
)

ACTIVE_REQUESTS = Gauge(
    'ml_active_requests',
    'Number of requests currently being processed'
)

class MLInferenceServer:
    def __init__(self, model, model_version="v1.0"):
        self.model = model
        self.model_version = model_version
        
    def predict(self, input_data):
        ACTIVE_REQUESTS.inc()
        
        try:
            start_time = time.time()
            
            # Run inference
            prediction = self.model.predict(input_data)
            confidence = float(np.max(prediction))
            
            # Record metrics
            latency = time.time() - start_time
            INFERENCE_LATENCY.labels(model_version=self.model_version).observe(latency)
            MODEL_CONFIDENCE.labels(model_version=self.model_version).observe(confidence)
            INFERENCE_COUNTER.labels(
                model_version=self.model_version,
                status='success'
            ).inc()
            
            return {
                'prediction': prediction.tolist(),
                'confidence': confidence,
                'latency_ms': latency * 1000
            }
            
        except Exception as e:
            INFERENCE_COUNTER.labels(
                model_version=self.model_version,
                status='error'
            ).inc()
            raise
            
        finally:
            ACTIVE_REQUESTS.dec()

# Start Prometheus metrics endpoint
if __name__ == "__main__":
    start_http_server(8000)  # Metrics available at :8000/metrics
    print("Prometheus metrics server started on :8000")
    # ... rest of Flask/FastAPI server code

6.2. Prometheus Scrape Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'ml-prod-us-east-1'

scrape_configs:
  # SageMaker endpoints
  - job_name: 'sagemaker-endpoints'
    static_configs:
      - targets:
        - 'fraud-detector:8080'
        - 'recommendation-engine:8080'
    relabel_configs:
      - source_labels: [__address__]
        target_label: endpoint_name
        regex: '([^:]+):.*'

  # Custom model servers
  - job_name: 'custom-ml-servers'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - ml-production
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        action: keep
        regex: ml-inference-server
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod_name
      - source_labels: [__meta_kubernetes_pod_label_model_version]
        target_label: model_version

  # Node exporter (infrastructure metrics)
  - job_name: 'node-exporter'
    kubernetes_sd_configs:
      - role: node
    relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)

# Alert rules
rule_files:
  - '/etc/prometheus/alerts/*.yml'

6.3. Alert Rules

# alerts/ml_service_alerts.yml
groups:
  - name: ml_inference_alerts
    interval: 30s
    rules:
      # High error rate
      - alert: HighInferenceErrorRate
        expr: |
          rate(ml_inference_requests_total{status="error"}[5m])
          /
          rate(ml_inference_requests_total[5m])
          > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.model_version }}"
          description: "Error rate is {{ $value | humanizePercentage }} over the last 5 minutes"

      # High latency
      - alert: HighInferenceLatency
        expr: |
          histogram_quantile(0.99,
            rate(ml_inference_latency_seconds_bucket[5m])
          ) > 1.0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High P99 latency on {{ $labels.model_version }}"
          description: "P99 latency is {{ $value }}s"

      # Low confidence predictions
      - alert: LowModelConfidence
        expr: |
          histogram_quantile(0.50,
            rate(ml_model_confidence_score_bucket[1h])
          ) < 0.7
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Model confidence degrading on {{ $labels.model_version }}"
          description: "Median confidence is {{ $value }}, may indicate drift"

      # Service down
      - alert: InferenceServiceDown
        expr: up{job="custom-ml-servers"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Inference service {{ $labels.pod_name }} is down"
          description: "Pod has been down for more than 5 minutes"

7. CloudWatch Insights Query Library

7.1. Finding Slowest Requests

-- CloudWatch Insights Query
-- Find the slowest 10 requests in the last hour
fields @timestamp, requestId, modelLatency, userId
| filter modelLatency > 1000  # More than 1 second
| sort modelLatency desc
| limit 10

7.2. Error Rate by Model Version

fields @timestamp, modelVersion, statusCode
| stats count() as total,
        sum(statusCode >= 500) as errors
        by modelVersion
| fields modelVersion, 
         errors / total * 100 as error_rate_percent
| sort error_rate_percent desc

7.3. Latency Percentiles Over Time

fields @timestamp, modelLatency
| filter modelVersion = "v2.1"
| stats pct(modelLatency, 50) as p50,
        pct(modelLatency, 90) as p90,
        pct(modelLatency, 99) as p99
        by bin(5m)

7.4. Anomaly Detection Query

# Find hours where request count deviated >2 stddev from average
fields @timestamp
| stats count() as request_count by bin(1h)
| stats avg(request_count) as avg_requests,
        stddev(request_count) as stddev_requests
| filter abs(request_count - avg_requests) > 2 * stddev_requests

8. Defining SLIs and SLOs for ML Systems

8.1. SLI (Service Level Indicators) Examples

SLIQueryGood Target
Availabilitysum(successful_requests) / sum(total_requests)99.9%
LatencyP99(inference_latency_ms)< 200ms
Freshnessnow() - last_model_update_timestamp< 7 days
Qualityavg(model_confidence)> 0.85

8.2. SLO Definition (YAML)

# slo_definitions.yml
apiVersion: monitoring.google.com/v1
kind: ServiceLevelObjective
metadata:
  name: fraud-detector-availability
spec:
  displayName: "Fraud Detector 99.9% Availability"
  serviceLevelIndicator:
    requestBased:
      goodTotalRatio:
        goodServiceFilter: |
          metric.type="custom.googleapis.com/inference/requests"
          metric.label.status="success"
        totalServiceFilter: |
          metric.type="custom.googleapis.com/inference/requests"
  goal: 0.999
  rollingPeriod: 2592000s  # 30 days

8.3. Error Budget Calculation

# error_budget_calculator.py
from dataclasses import dataclass
from datetime import datetime, timedelta

@dataclass
class SLO:
    name: str
    target: float  # e.g., 0.999 for 99.9%
    window_days: int

class ErrorBudgetCalculator:
    def __init__(self, slo: SLO):
        self.slo = slo
        
    def calculate_budget(self, total_requests: int, failed_requests: int):
        """
        Calculate remaining error budget.
        """
        # Current availability
        current_availability = (total_requests - failed_requests) / total_requests
        
        # Allowed failures
        allowed_failures = total_requests * (1 - self.slo.target)
        
        # Budget remaining
        budget_remaining = allowed_failures - failed_requests
        budget_percent = (budget_remaining / allowed_failures) * 100
        
        # Time to exhaustion
        failure_rate = failed_requests / total_requests
        if failure_rate > (1 - self.slo.target):
            # Burning budget
            time_to_exhaustion = self.estimate_exhaustion_time(
                budget_remaining,
                failure_rate,
                total_requests
            )
        else:
            time_to_exhaustion = None
        
        return {
            'slo_target': self.slo.target,
            'current_availability': current_availability,
            'budget_remaining': budget_remaining,
            'budget_percent': budget_percent,
            'status': 'healthy' if budget_percent > 10 else 'critical',
            'time_to_exhaustion_hours': time_to_exhaustion
        }
    
    def estimate_exhaustion_time(self, budget_remaining, failure_rate, total_requests):
        # Simplified linear projection
        failures_per_hour = failure_rate * (total_requests / self.slo.window_days / 24)
        return budget_remaining / failures_per_hour if failures_per_hour > 0 else None

# Usage
slo = SLO(name="Fraud Detector", target=0.999, window_days=30)
calculator = ErrorBudgetCalculator(slo)

result = calculator.calculate_budget(
    total_requests=1000000,
    failed_requests=1500
)

print(f"Error budget remaining: {result['budget_percent']:.1f}%")
if result['time_to_exhaustion_hours']:
    print(f"⚠️  Budget will be exhausted in {result['time_to_exhaustion_hours']:.1f} hours!")

9. Incident Response Playbooks

9.1. Runbook: High Latency Incident

# Runbook: ML Inference High Latency

## Trigger
- P99 latency > 500ms for 10+ minutes
- Alert: `HighInferenceLatency` fires

## Severity
**SEV-2** (Degraded service, users experiencing slowness)

## Investigation Steps

### 1. Check if it's a global issue
```bash
# CloudWatch
aws cloudwatch get-metric-statistics \
  --namespace AWS/SageMaker \
  --metric-name ModelLatency \
  --dimensions Name=EndpointName,Value=fraud-detector-prod \
  --statistics Average,p99 \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300

If all regions are slow → Likely model issue or infrastructure If single region → Network or regional infrastructure

2. Check for deployment changes

# Check recent deployments in last 2 hours
aws sagemaker list-endpoint-configs \
  --creation-time-after $(date -u -d '2 hours ago' +%Y-%m-%dT%H:%M:%S) \
  --sort-by CreationTime

Recent deployment? → Potential regression, consider rollback

3. Check instance health

# CPU/Memory utilization
aws cloudwatch get-metric-statistics \
  --namespace AWS/SageMaker \
  --metric-name CPUUtilization \
  --dimensions Name=EndpointName,Value=fraud-detector-prod

CPU > 90%? → Scale out (increase instance count) Memory > 85%? → Risk of OOM, check for memory leak

4. Check input data characteristics

# Sample recent requests, check input size distribution
import boto3
s3 = boto3.client('s3')

# Download last 100 captured requests
for key in recent_keys:
    request = json.load(s3.get_object(Bucket=bucket, Key=key))
    print(f"Input size: {len(request['features'])} features")

Unusual input sizes? → May indicate upstream data corruption

Mitigation Options

Option A: Scale Out (Increase Instances)

aws sagemaker update-endpoint \
  --endpoint-name fraud-detector-prod \
  --endpoint-config-name fraud-detector-config-scaled

ETA: 5-10 minutes Risk: Low

Option B: Rollback to Previous Version

aws sagemaker update-endpoint \
  --endpoint-name fraud-detector-prod \
  --endpoint-config-name fraud-detector-config-v1.9-stable \
  --retain-deployment-config

ETA: 3-5 minutes Risk: Medium (may reintroduce old bugs)

Option C: Enable Caching

# If latency is due to repeated similar requests
# Add Redis cache in front of SageMaker

ETA: 30 minutes (code deploy) Risk: Medium (cache invalidation complexity)

Post-Incident Review

  • Document root cause
  • Update alerts if false positive
  • Add monitoring for specific failure mode

### 9.2. Runbook: Model Accuracy Degradation

```markdown
# Runbook: Model Accuracy Degradation

## Trigger
- Business metrics show increased fraud escapes
- Median prediction confidence < 0.7

## Investigation

### 1. Compare recent vs baseline predictions
```python
# Pull samples from production
recent_predictions = get_predictions(hours=24)
baseline_predictions = load_validation_set_predictions()

# Compare distributions
from scipy.stats import ks_2samp
statistic, p_value = ks_2samp(
    recent_predictions['confidence'],
    baseline_predictions['confidence']
)

if p_value < 0.05:
    print("⚠️  Significant distribution shift detected")

2. Check for data drift

→ See Chapter 18.3 for detailed drift analysis

3. Check model version

# Verify correct model is deployed
aws sagemaker describe-endpoint --endpoint-name fraud-detector-prod \
  | jq '.ProductionVariants[0].DeployedImages[0].SpecifiedImage'

Mitigation

  • Trigger retraining pipeline
  • Deploy shadow model with recent data
  • Consider fallback to rule-based system temporarily

---

## 10. Monitoring Automation Scripts

### 10.1. Auto-Scaling Based on Queue Depth

```python
# autoscaler.py
import boto3
import time

cloudwatch = boto3.client('cloudwatch')
sagemaker = boto3.client('sagemaker')

def get_queue_depth(endpoint_name):
    response = cloudwatch.get_metric_statistics(
        Namespace='AWS/SageMaker',
        MetricName='OverheadLatency',
        Dimensions=[{'Name': 'EndpointName', 'Value': endpoint_name}],
        StartTime=datetime.utcnow() - timedelta(minutes=5),
        EndTime=datetime.utcnow(),
        Period=300,
        Statistics=['Average']
    )
    return response['Datapoints'][0]['Average'] if response['Datapoints'] else 0

def scale_endpoint(endpoint_name, target_instance_count):
    # Get current config
    endpoint = sagemaker.describe_endpoint(EndpointName=endpoint_name)
    current_config = endpoint['EndpointConfigName']
    
    # Create new config with updated instance count
    new_config_name = f"{endpoint_name}-scaled-{int(time.time())}"
    
    # ... create new endpoint config with target_instance_count ...
    
    # Update endpoint
    sagemaker.update_endpoint(
        EndpointName=endpoint_name,
        EndpointConfigName=new_config_name
    )
    
    print(f"Scaling {endpoint_name} to {target_instance_count} instances")

def autoscale_loop():
    while True:
        queue_depth = get_queue_depth('fraud-detector-prod')
        
        if queue_depth > 100:  # Queue building up
            scale_endpoint('fraud-detector-prod', current_count + 1)
        elif queue_depth < 10 and current_count > 1:  # Under-utilized
            scale_endpoint('fraud-detector-prod', current_count - 1)
        
        time.sleep(60)  # Check every minute

if __name__ == "__main__":
    autoscale_loop()

10.2. Health Check Daemon

# health_checker.py
import requests
import time
from datetime import datetime

ENDPOINTS = [
    {'name': 'fraud-detector', 'url': 'https://api.company.com/v1/fraud/predict'},
    {'name': 'recommendation', 'url': 'https://api.company.com/v1/recommend'},
]

def health_check(endpoint):
    try:
        start = time.time()
        response = requests.post(
            endpoint['url'],
            json={'dummy': 'data'},
            timeout=5
        )
        latency = (time.time() - start) * 1000
        
        return {
            'endpoint': endpoint['name'],
            'status': 'healthy' if response.status_code == 200 else 'unhealthy',
            'latency_ms': latency,
            'timestamp': datetime.utcnow().isoformat()
        }
    except Exception as e:
        return {
            'endpoint': endpoint['name'],
            'status': 'error',
            'error': str(e),
            'timestamp': datetime.utcnow().isoformat()
        }

def monitor_loop():
    while True:
        for endpoint in ENDPOINTS:
            result = health_check(endpoint)
            
            # Push to monitoring system
            publish_metric(result)
            
            if result['status'] != 'healthy':
                send_alert(result)
        
        time.sleep(30)  # Check every 30 seconds

if __name__ == "__main__":
    monitor_loop()

11. Cost Monitoring for ML Infrastructure

11.1. Cost Attribution by Model

# cost_tracker.py
import boto3
from datetime import datetime, timedelta

ce = boto3.client('ce')  # Cost Explorer

def get_ml_costs(start_date, end_date):
    response = ce.get_cost_and_usage(
        TimePeriod={
            'Start': start_date,
            'End': end_date
        },
        Granularity='DAILY',
        Filter={
            'Dimensions': {
                'Key': 'SERVICE',
                'Values': ['Amazon SageMaker']
            }
        },
        Metrics=['UnblendedCost'],
        GroupBy=[
            {'Type': 'TAG', 'Key': 'ModelName'},
            {'Type': 'DIMENSION', 'Key': 'INSTANCE_TYPE'}
        ]
    )
    
    costs = {}
    for result in response['ResultsByTime']:
        date = result['TimePeriod']['Start']
        for group in result['Groups']:
            model_name = group['Keys'][0]
            instance_type = group['Keys'][1]
            cost = float(group['Metrics']['UnblendedCost']['Amount'])
            
            if model_name not in costs:
               costs[model_name] = {}
            costs[model_name][instance_type] = costs[model_name].get(instance_type, 0) + cost
    
    return costs

# Generate weekly report
costs = get_ml_costs(
    (datetime.now() - timedelta(days=7)).strftime('%Y-%m-%d'),
    datetime.now().strftime('%Y-%m-%d')
)

print("Weekly ML Infrastructure Costs:")
for model, instances in costs.items():
    total = sum(instances.values())
    print(f"\n{model}: ${total:.2f}")
    for instance, cost in instances.items():
        print(f"  {instance}: ${cost:.2f}")

12. Conclusion

Monitoring ML systems is fundamentally different from monitoring traditional software. The metrics that matter most—model quality, prediction confidence, data drift—are domain-specific and require custom instrumentation.

Key takeaways:

  1. Layer your observability: Infrastructure → Application → Model
  2. Alert on symptoms, not causes: Users don’t care if CPU is high, they care if latency is high
  3. Automate everything: From alerts to scaling to incident response
  4. Monitor costs: GPU time is expensive, track it like you track errors

In the next section, we go even deeper into GPU-specific observability, exploring DCGM and how to truly understand what’s happening on the silicon.