19.2 Cloud Tools: SageMaker Clarify & Vertex AI Explainable AI

While open-source libraries like shap and lime are excellent for experimentation, running them at scale in production presents significant challenges.

Compute Cost: Calculating SHAP values for millions of predictions requires massive CPU/GPU resources.
Latency: In-line explanation calculation can add hundreds of milliseconds to an inference call.
Governance: Storing explanatory artifacts (e.g., “Why was this loan denied?”) for 7 years for regulatory auditing requires a robust data lifecycle solution, not just a bunch of JSON files on a laptop.
Bias Monitoring: Explainability is half the battle; Fairness is the other half. Monitoring for disparate impact requires specialized statistical tooling.

The major cloud providers have wrapped these open-source standards into fully managed services: AWS SageMaker Clarify and Google Cloud Vertex AI Explainable AI. This chapter bridges the gap between the algorithms of 19.1 and the infrastructure of the Enterprise.

1. AWS SageMaker Clarify

SageMaker Clarify is a specialized processing container provided by AWS that calculates Bias Metrics and Feature Attribution (SHAP). It integrates deeply with SageMaker Data Wrangler, Model Monitor, and Pipelines.

1.1. The Architecture

Clarify is not a “real-time” service in the same way an endpoint is. It functions primarily as a standardized Processing Job.

Input:
- Dataset (S3): Your training or inference data.
- Model (SageMaker Model): Ephemeral shadow endpoint.
- Config (Analysis Config): What to calculate.
Process:
1. Clarify spins up the requested instances (e.g., ml.c5.xlarge).
2. It spins up a “Shadow Model” (a temporary endpoint) serving your model artifact.
3. It iterates through your dataset, sending Explainability/Bias requests to the shadow model.
4. It computes the statistics.
Output:
- S3: Analysis results (JSON).
- S3: A generated PDF report.
- SageMaker Studio: Visualization of the report.

1.2. Pre-Training Bias Detection

Before you even train a model, Clarify can analyze your raw data for historical bias.

Why? Garbage In, Garbage Our. If your hiring dataset is 90% Male, your model will likely learn that “Male” is a feature of “Hired”.

Common Metrics:

Class Imbalance (CI): Measures if one group is underrepresented.
Difference in Proportions of Labels (DPL): “Do Men get hired more often than Women in the training set?”
Kullback-Leibler Divergence (KL): Difference in distribution of outcomes.
Generalized Entropy (GE): An index of inequality (variant of Theil Index).

1.3. Post-Training Bias Detection

After training, you check if the model amplified the bias or introduced new ones.

Disparate Impact (DI): Ratio of acceptance rates. (e.g., If 50% of Men are hired but only 10% of Women, DI = 0.2. A common legal threshold is 0.8).
Difference in Positive Rates: Statistical difference in outcomes.

1.4. Implementation: Configuring a Clarify Job

Let’s walk through a complete Python SDK implementation for a Credit Risk analysis.

# 1. Setup
from sagemaker import clarify
from sagemaker import Session
import boto3

session = Session()
bucket = session.default_bucket()
role = sagemaker.get_execution_role()

# Define where your data lives
train_uri = f"s3://{bucket}/data/train.csv"
model_name = "credit-risk-xgboost-v1"

# 2. Define the Processor
clarify_processor = clarify.SageMakerClarifyProcessor(
    role=role,
    instance_count=1,
    instance_type='ml.c5.xlarge',
    sagemaker_session=session
)

# 3. Configure Data Input
# Clarify needs to know which column is the target and where the data is.
data_config = clarify.DataConfig(
    s3_data_input_path=train_uri,
    s3_output_path=f"s3://{bucket}/clarify-output",
    label='Default',  # Target column
    headers=['Income', 'Age', 'Debt', 'Default'],
    dataset_type='text/csv'
)

# 4. Configure Model access
# Clarify will spin up this model to query it.
model_config = clarify.ModelConfig(
    model_name=model_name,
    instance_type='ml.m5.xlarge',
    instance_count=1,
    accept_type='text/csv',
    content_type='text/csv'
)

# 5. Configure Bias Analysis
# We define the "Sensitive Group" (Facet) that we want to protect.
bias_config = clarify.BiasConfig(
    label_values_or_threshold=[0], # 0 = No Default (Good Outcome)
    facet_name='Age',              # Protected Feature
    facet_values_or_threshold=[40], # Group defined as Age < 40 (Young)
    group_name='Age_Group'         # Optional grouping
)

# 6. Configure Explainability (SHAP)
# We use KernelSHAP (approximation) because it works on any model.
shap_config = clarify.SHAPConfig(
    baseline=[[50000, 35, 10000]], # Reference customer (Average)
    num_samples=100,               # Number of perturbations (higher = slower/more accurate)
    agg_method='mean_abs',         # How to aggregate global importance
    save_local_shap_values=True    # Save SHAP for every single row (Heavy!)
)

# 7. Run the Job
clarify_processor.run_bias_and_explainability(
    data_config=data_config,
    model_config=model_config,
    bias_config=bias_config,
    explainability_config=shap_config,
    methods={
        "report": {"name": "report", "title": "Credit Risk Fairness Audit"},
        "pre_training_bias": {"methods": "all"},
        "post_training_bias": {"methods": "all"},
        "shap": {"methods": "all"}
    }
)

1.5. Interpreting the Results

Once the job completes (can take 20-40 minutes), you check S3.

The PDF Report: Clarify generates a surprisingly high-quality PDF. It includes:

Histograms of label distributions.
Tables of Bias Metrics (DI, DPL) with Green/Red indicators based on best practices.
Global SHAP Bar Charts.

SageMaker Studio Integration: If you open the “Experiments” tab in Studio, you can see these charts interactively. You can click on a specific bias metric to see its definition and history over time.

2. GCP Vertex AI Explainable AI

Google Cloud takes a slightly different architectural approach. While AWS emphasizes “Offline Usage” (Batch Processing Jobs), Google emphasizes “Online Usage” (Real-time Explanations).

2.1. Feature Attribution Methods

Vertex AI supports three main algorithms, optimized for their infrastructure:

Sampled Shapley: An approximation of SHAP for tabular data.
Integrated Gradients (IG): For Differentiable models (TensorFlow/PyTorch/Keras).
XRAI (eXplanation with Ranked Area Integrals): Specifically for Computer Vision. XRAI is better than Grad-CAM or vanilla IG for images because it segments the image into “regions” (superpixels) and attributes importance to regions, not just pixels. This produces much cleaner heatmaps.

2.2. Configuration: The `explanation_metadata`

Vertex AI requires you to describe your model’s inputs/outputs explicitly in a JSON structure. This is often the hardest part for beginners.

Why? A TensorFlow model accepts Tensors (e.g., shape [1, 224, 224, 3]). Humans understand “Image”. The metadata maps “Tensor Input ‘input_1’” to “Modality: Image”.

/* explanation_metadata.json */
{
  "inputs": {
    "pixels": {
      "input_tensor_name": "input_1:0",
      "modality": "image"
    }
  },
  "outputs": {
    "probabilities": {
      "output_tensor_name": "dense_2/Softmax:0"
    }
  }
}

2.3. Deployment with Explanations

When you deploy a model to an Endpoint, you enable explanations.

from google.cloud import aiplatform

# 1. configure Explanation Parameters
# We choose XRAI for an image model
explanation_parameters = aiplatform.explain.ExplanationParameters(
    {"xrai_attribution": {"step_count": 50}} # 50 integration steps
)

# 2. Upload Model with Explanation Config
model = aiplatform.Model.upload(
    display_name="resnet-classifier",
    artifact_uri="gs://my-bucket/model-artifacts",
    serving_container_image_uri="us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-8:latest",
    explanation_metadata=aiplatform.explain.ExplanationMetadata(
        inputs={"image": {"input_tensor_name": "input_layer"}},
        outputs={"scores": {"output_tensor_name": "output_layer"}}
    ),
    explanation_parameters=explanation_parameters
)

# 3. Deploy to Endpoint
endpoint = model.deploy(
    machine_type="n1-standard-4"
)

2.4. Asking for an Explanation

Now, instead of just endpoint.predict(), you can call endpoint.explain().

# Client-side code
import base64

with open("test_image.jpg", "rb") as f:
    img_bytes = f.read()
    b64_img = base64.b64encode(img_bytes).decode("utf-8")

# Request Explanation
response = endpoint.explain(
    instances=[{"image": b64_img}]
)

# Parse visual explanation
for explanation in response.explanations:
    # Attribution for the predicted class
    attributions = explanation.attributions[0]
    
    # The visualization is returned as a base64 encoded image overlay!
    b64_visualization = attributions.feature_attributions['image']['b64_jpeg']
    
    print("Baseline Score:", attributions.baseline_output_value)
    print("Instance Score:", attributions.instance_output_value)
    print("Approximation Error:", attributions.approximation_error)

Key Difference: Google does the visualization server-side for methods like XRAI and returns a usable image overlay. AWS typically gives you raw numbers and expects you to plot them.

3. Comparison and Architectures

3.1. AWS vs. GCP

Feature	AWS SageMaker Clarify	GCP Vertex AI Explainable AI
Primary Mode	Batch (Analysis Jobs)	Online (Real-time API)
Setup Difficulty	Medium (Python SDK)	High (Metadata JSON mapping)
Methods	SHAP (Kernel), Partial Dependence	Sampled Shapley, IG, XRAI
Visualization	Studio (Interactive), PDF Reports	Console (Basic), Client-side needed
Bias Detection	Excellent (Many built-in metrics)	Basic
Cost	You pay for Processing Instances	You pay for Inference Node utilization

3.2. Cost Management Strategies

XAI is computationally expensive.

KernelSHAP: Complexity is $O(Samples \times Features)$.
Integrated Gradients: Availability is $O(Steps \times Layers)$.

Configuring num_samples=1000 instead of 100 makes the job 10x more expensive.

Optimization Tips:

Downsample Background Data: For KernelSHAP, do not use your full training set as the baseline. Use K-Means to find 20-50 representative cluster centroids.
Use TreeSHAP: If on AWS, check if TreeSHAP is supported for your XGBoost model version. It is 1000x faster than KernelSHAP.
Lazy Evaluation: Do not explain every prediction in production.
- Microservice Pattern: Log predictions to S3. Run a nightly Batch Clarify job to explain the “Top 1% Anomalous Predictions” or a random 5% sample.
- On-Demand: Only call endpoint.explain() when a Customer Support agent presses the “Why?” button.

4. Integration with MLOps Pipelines

XAI should not be a manual notebook exercise. It must be a step in your CI/CD pipeline.

4.1. SageMaker Pipelines Integration

You can add a ClarifyCheckStep to your training pipeline. If bias exceeds a threshold, the pipeline fails and rejects the model registry.

from sagemaker.workflow.clarify_check_step import (
    ClarifyCheckStep, 
    ModelBiasCheckConfig, 
    ModelPredictedLabelConfig
)
from sagemaker.workflow.check_job_config import CheckJobConfig

# Define Check Config
bias_check_config = check_job_config = CheckJobConfig(
    role=role,
    instance_count=1,
    instance_type='ml.c5.xlarge',
    sagemaker_session=session
)

bias_check_step = ClarifyCheckStep(
    name="CheckBias",
    clarify_check_config=ModelBiasCheckConfig(
        data_config=data_config,
        data_bias_config=bias_config, # Defined previously
        model_config=model_config,
        model_predicted_label_config=ModelPredictedLabelConfig(label='Default')
    ),
    check_job_config=check_job_config,
    skip_check=False,
    register_new_baseline=True # Save this run as the new standard
)

# Add to pipeline
pipeline = Pipeline(
    name="FairnessAwarePipeline",
    steps=[step_train, step_create_model, bias_check_step, step_register]
)

The Gatekeeper Pattern: By placing the CheckBias step before RegisterModel, you automagically enforce governance. No biased model can ever reach the Model Registry, and thus no biased model can ever reach Production.

4.2. Vertex AI Pipelines Integration

Vertex Pipelines (based on Kubeflow) treats evaluation similarly.

from google_cloud_pipeline_components.v1.model_evaluation import (
    ModelEvaluationClassificationOp
)

# Within a pipeline() definition
eval_task = ModelEvaluationClassificationOp(
    project=project_id,
    location=region,
    target_field_name="Default",
    model=training_op.outputs["model"],
    batch_predict_gcs_source_uris=[test_data_uri]
)

# Evaluation with XAI is a custom component wrapper around the Batch Explain API

5. Security and IAM

XAI services need deep access. They need to:

Read raw training data (PII risk).
Invoke the model (IP risk).
Write explanations (Business Logic risk).

5.1. AWS IAM Policies

To run Clarify, the Execution Role needs sagemaker:CreateProcessingJob and s3:GetObject.

Least Privilege Example:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "sagemaker:CreateProcessingJob",
                "sagemaker:CreateEndpoint",
                "sagemaker:DeleteEndpoint",
                "sagemaker:InvokeEndpoint"
            ],
            "Resource": [
                "arn:aws:sagemaker:us-east-1:1234567890:model/credit-risk-*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::my-secure-bucket/training-data/*"
        }
    ]
}

Note: Clarify creates a shadow endpoint. This means it needs CreateEndpoint permissions. This often surprises security teams (“Why is the analysis job creating endpoints?”). You must explain that this is how Clarify queries the model artifact.

6. Dashboards and Reporting

6.1. SageMaker Model Monitor with Explainability

You can schedule Clarify to run hourly on your inference data (Model Monitor). This creates a longitudinal view of “Feature Attribution Drift.”

Scenario:
- Day 1: “Income” is the top driver.
- Day 30: “Zip Code” becomes the top driver.
Alert: This is usually a sign of Concept Drift or a change in the upstream data pipeline (e.g., Income field is broken/null, so model relies on Zip Code proxy).
Action: CloudWatch Alarm -> Retrain.

6.2. Custom Dashboards (Streamlit)

While Cloud Consoles are nice, stakeholders often need simplified views. You can parse the Clarify JSON output to build a custom Streamlit app.

import streamlit as st
import pandas as pd
import json
import matplotlib.pyplot as plt

st.title("Fairness Audit Dashboard")

# Load Clarify JSON
with open('analysis.json') as f:
    audit = json.load(f)

# Display Bias Metrics
st.header("Bias Metrics")
metrics = audit['pre_training_bias_metrics']['facets']['Age'][0]['metrics']
df_metrics = pd.DataFrame(metrics)
st.table(df_metrics)

# Display SHAP
st.header("Global Feature Importance")
shap_data = audit['explanations']['kernel_shap']['global_shap_values']
# Plotting logic...
st.bar_chart(shap_data)

7. Hands-on Lab: Configuring a SHAP Analysis in AWS

Let’s walk through the “Gold Standard” configuration for a regulated industry setup.

7.1. Step 1: The Baseline

We need a reference dataset. We cannot use zero-imputation (Income=0, Age=0 is not a real person). We use K-Means.

from sklearn.cluster import KMeans

# Summarize training data
kmeans = KMeans(n_clusters=10, random_state=0).fit(X_train)
baseline_centers = kmeans.cluster_centers_

# Save to CSV for the config
pd.DataFrame(baseline_centers).to_csv("baseline.csv", header=False, index=False)

7.2. Step 2: The Analysis Configuration in JSON

While Python SDK is great, in production (Terraform/CloudFormation), you often pass a JSON config.

{
  "dataset_type": "text/csv",
  "headers": ["Income", "Age", "Debt"],
  "label": "Default",
  "methods": {
    "shap": {
      "baseline": "s3://bucket/baseline.csv",
      "num_samples": 500,
      "agg_method": "mean_abs",
      "use_logit": true,
      "save_local_shap_values": true
    },
    "post_training_bias": {
      "bias_metrics": {
        "facets": [
          {
            "name_or_index": "Age",
            "value_or_threshold": [40]
          }
        ],
        "label_values_or_threshold": [0]
      }
    }
  },
  "predictor": {
    "model_name": "production-model-v2",
    "instance_type": "ml.m5.large",
    "initial_instance_count": 1
  }
}

7.3. Step 3: Automation via Step Functions

You define an AWS Step Functions state machine with the following flow:

Train Model (SageMaker Training Job).
Create Model (Register Artifact).
Run Clarify (Processing Job).
Check Metrics (Lambda Function to parse JSON).
- If DI < 0.8: Fail pipeline.
- If DI >= 0.8: Deploy to Staging.

This “Governance as Code” approach is the ultimate maturity level for MLOps.

8. Summary

AWS SageMaker Clarify: Best for batched, comprehensive reporting (Fairness + SHAP). Integrates tightly with Pipelines for “Quality Gates.”
GCP Vertex AI Explainable AI: Best for real-time, on-demand explanations (especially for images/deep learning) via endpoint.explain().
Cost: These services spin up real compute resources. Use Sampling and Lazy Evaluation to manage budgets.
Governance: Use these tools to automate the generation of compliance artifacts. Do not rely on data scientists saving PNGs to their laptops.

In the next chapter, we will see how to fix the bugs revealed by these explanations using systematic Debugging techniques.

9. Advanced Configuration & Security

Running XAI on sensitive data (PII/PHI) requires strict security controls. Both AWS and GCP allow you to run these jobs inside secure network perimeters.

9.1. VPC & Network Isolation

By default, Clarify jobs run in a service-managed VPC. For regulated workloads, you must run them in your VPC to ensure data never traverses the public internet.

AWS Configuration:

network_config = clarify.NetworkConfig(
    enable_network_isolation=True,   # No internet access
    security_group_ids=['sg-12345'], # Your security group
    subnets=['subnet-abcde']         # Your private subnet
)

processor.run_bias_and_explainability(
    ...,
    network_config=network_config
)

GCP Configuration: When creating the Endpoint, you peer the Vertex AI network with your VPC.

endpoint = aiplatform.Endpoint.create(
    display_name="secure-endpoint",
    network="projects/123/global/networks/my-vpc" # VPC Peering
)

9.2. Data Encryption (KMS)

You should never store explanations (which reveal model behavior) in plain text.

AWS KMS Integration:

# Output Config with KMS Key
data_config = clarify.DataConfig(
    ...,
    s3_output_path="s3://secure-bucket/output",
    s3_upload_session=sagemaker.Session(kms_key_id="arn:aws:kms:...")
)

Metric: If you lose the KMS key, you lose the “Why” of your decisions. Ensure your Key Policy allows the sagemaker.amazonaws.com principal to kms:GenerateDataKey and kms:Decrypt.

10. Deep Dive: Bias Metrics Dictionary

Clarify produces an alphabet soup of acronyms. Here is the Rosetta Stone for the most critical ones.

10.1. Pre-Training Metrics (Data Bias)

Class Imbalance (CI)
- Question: “Do I have enough samples of the minority group?”
- Formula: $CI = \frac{n_a - n_d}{n_a + n_d}$ where $n_a$ = favoured group count, $n_d$ = disfrouved count.
- Range: $[-1, 1]$. 0 is perfect balance. Positive values mean the sensitive group holds the minority.
Difference in Proportions of Labels (DPL)
- Question: “Does the Training Data simply give more positive labels to Men than Women?”
- Formula: $DPL = q_a - q_d$ where $q$ is the proportion of positive labels (e.g., “Hired”).
- Range: $[-1, 1]$. 0 is equality. If DPL is high (>0.1), your labels themselves are biased.
Kullback-Leibler Divergence (KL)
- Question: “How different are the outcome distributions?”
- Math: $P_a(y) \log \frac{P_a(y)}{P_d(y)}$.
- Usage: Good for multi-class problems where simple proportions fail.

10.2. Post-Training Metrics (Model Bias)

Disparate Impact (DI)
- Question: “Is the acceptance rate for Women at least 80% of the rate for Men?” (The ‘Four-Fifths Rule’).
- Formula: $DI = \frac{q’_d}{q’_a}$ (Ratio of predicted positive rates).
- Range: $[0, \infty]$. 1.0 is equality. $< 0.8$ is often considered disparate impact in US Law.
Difference in Conditional Acceptance (DCA)
- Question: “Among those who should have been hired (True Positives + False Negatives), did we accept fewer Women?”
- Nuance: This checks if the model is inaccurate specifically for qualified candidates of the minority group.
Generalized Entropy (GE)
- Usage: Measures inequality in the distribution of errors. If the model is 90% accurate for everyone, GE is low. If it is 99% accurate for Men and 81% for Women, GE is high.

11. Infrastructure as Code: Terraform

Managing XAI via Python scripts is fine for discovery, but Production means Terraform.

11.1. AWS Step Functions approach

We don’t define the “Job” in Terraform (since it’s ephemeral), we define the Pipeline that launches the job.

# IAM Role for Clarify
resource "aws_iam_role" "clarify_role" {
  name = "mlops-clarify-execution-role"
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action = "sts:AssumeRole"
      Effect = "Allow"
      Principal = { Service = "sagemaker.amazonaws.com" }
    }]
  })
}

# S3 Bucket for Reports
resource "aws_s3_bucket" "clarify_reports" {
  bucket = "company-ml-clarify-reports"
  acl    = "private"
  
  server_side_encryption_configuration {
    rule {
      apply_server_side_encryption_by_default {
        sse_algorithm = "AES256"
      }
    }
  }
}

11.2. GCP Vertex AI Metadata Shielding

For GCP, we ensure the Metadata store (where artifacts are tracked) is established.

resource "google_vertex_ai_metadata_store" "main" {
  provider    = google-beta
  name        = "default"
  description = "Metadata Store for XAI Artifacts"
  region      = "us-central1"
}

12. Troubleshooting Guide

When your Clarify job fails (and it will), here are the usual suspects.

12.1. “ClientError: DataFrame is empty”

Symptom: Job dies immediately.
Cause: The filter you applied in bias_config (e.g., Age < 18) resulted in zero rows.
Fix: Check your dataset distribution. Ensure your label/facet values match the data types (Integers vs Strings). A common error is passing label_values=[0] (int) when the CSV contains "0" (string).

12.2. “Ping Timeout / Model Latency”

Symptom: Job runs for 10 minutes then fails with a timeout.
Cause: Calculating SHAP requires thousands of requests. The Shadow Endpoint is overwhelmed.
Fix:
1. Increase instance_count in model_config (Scale out the shadow model).
2. Decrease num_samples in shap_config (Reduce precision).
3. Check if your model container has a gunicorn timeout. Increase it to 60s.

12.3. “Memory Error (OOM)”

Symptom: Processing container dies with Exit Code 137.
Cause: save_local_shap_values=True on a large dataset tries to hold the entire interaction matrix (N x M) in RAM before writing.
Fix:
1. Switch to ml.m5.12xlarge or memory optimized instances (ml.r5).
2. Shard your input dataset and run multiple Clarify jobs in parallel, then aggregate.

12.4. “Headers mismatch”

Symptom: “Number of columns in data does not match headers.”
Cause: SageMaker Clarify expects headless CSVs by default if you provide a headers list, OR it expects the headers to match exactly if dataset_type is configured differently.
Fix: Be explicit. Use dataset_type='text/csv' and ensure your S3 file has NO header row if you are passing headers=[...] in the config.

13. Future Proofing: Foundation Model Evaluation

As of late 2024, AWS and GCP have extended these tools for LLMs.

13.1. AWS FMEval

AWS introduced the fmeval library (open source, integrated with Clarify) to measure LLM-specific biases:

Stereotyping: analyzing prompt continuations.
Toxicity: measuring hate speech generation.

from fmeval.eval_algorithms.toxicity import Toxicity
from fmeval.data_loaders.data_config import DataConfig

config = DataConfig(
    dataset_name="my_prompts",
    dataset_uri="s3://...",
    dataset_mime_type="application/jsonlines",
    model_input_location="prompt"
)

eval_algo = Toxicity()
results = eval_algo.evaluate(model=model_runner, dataset_config=config)

This represents the next frontier: Operationalizing the ethics of Generating text, rather than just classifying numbers.

14. Case Study: Healthcare Fairness

Let’s look at a real-world application of these tools in a life-critical domain.

The Scenario: A large hospital network builds a model to predict “Patient Readmission Risk” within 30 days. High-risk patients get a follow-up call from a nurse. The Model: An XGBoost Classifier trained on EMR (Electronic Medical Record) data. The Concern: Does the model under-prioritize patients from certain zip codes or demographics due to historical inequities in healthcare access?

14.1. The Audit Strategy

The MLOps team sets up a SageMaker Clarify pipeline.

Facet: Race/Ethnicity (Derived from EMR).
Label: Readmitted (1) vs Healthy (0).
Metric: False Negative Rate (FNR) Difference.
- Why FNR? A False Negative is the worst case: The patient was high risk, but model said “Healthy”, so they didn’t get a call, and they ended up back in the ER.
- If FNR is higher for Group A than Group B, the model is “failing” Group A more often.

14.2. Implementation

bias_config = clarify.BiasConfig(
    label_values_or_threshold=[0], # 0 is "Healthy" (The prediction we verify)
    facet_name='Race',
    facet_values_or_threshold=['MinorityGroup'],
    group_name='demographics'
)

# Run Analysis focusing on Post-training Bias
clarify_processor.run_bias(
    ...,
    methods={"post_training_bias": {"methods": ["DPL", "DI", "FT", "FNR"]}}
)

14.3. The Findings

The report comes back.

Disparate Impact (DI): 0.95 (Green). The selection rate is equal. Both groups get calls at the same rate.
FNR Difference: 8% (Red).
Interpretation: Even though the model suggests calls at the same rate (DI is fine), it is less accurate for the Minority Group. It misses high-risk patients in that group more often than in the baseline group.
Root Cause Analysis: Global SHAP shows that Number_of_Prior_Visits is the #1 feature.
Societal Context: The Minority Group historically has less access to primary care, so they have fewer “Prior Visits” validation in the system. The model interprets “Low Visits” as “Healthy”, when it actually means “Underserved”.
Fix: The team switches to Grouped Permutation Importance and creates a new feature: Visits_Per_Year_Since_Diagnosis. They prompt retrain.

15. Case Study: Fintech Reg-Tech

The Scenario: A Neo-bank offers instant micro-loans. Users apply via app. The Regression: A Deep Learning model (TabNet) predicts Max_Loan_Amount. The Law: The Equal Credit Opportunity Act (ECOA) requires that if you take adverse action (deny or lower limits), you must provide “specific reasons.”

15.1. The Engineering Challenge

The app needs to show the user: “We couldn’t give you $500 because [Reason 1] and [Reason 2].” and this must happen in < 200ms. Batch Clarify is too slow. They move to GCP Vertex AI Online Explanation.

15.2. Architecture

Model: Hosted on Vertex AI Endpoint with machine_type="n1-standard-4".
Explanation: Configured with SampledShapley (Path count = 10 for speed).
Client: The mobile app backend calls endpoint.explain().

15.3. Mapping SHAP to “Reg Speak”

The raw API returns detailed feature attributions:

income_last_month: -0.45
avg_balance: +0.12
nsf_count (Non-Sufficient Funds): -0.85

You cannot show “nsf_count: -0.85” to a user. The team builds a Reason Code Mapper:

REASON_CODES = {
    "nsf_count": "Recent overdraft activity on your account.",
    "income_last_month": "Monthly income level.",
    "credit_utilization": "Ratio of credit used across accounts."
}

def generate_rejection_letter(attributions):
    # Sort negative features by magnitude
    negatives = {k:v for k,v in attributions.items() if v < 0}
    top_3 = sorted(negatives, key=negatives.get)[:3]
    
    reasons = [REASON_CODES[f] for f in top_3]
    return f"We could not approve your loan due to: {'; '.join(reasons)}"

This maps the mathematical “Why” (XAI) to the regulatory “Why” (Adverse Action Notice).

16. The TCO of Explainability

Explainability is expensive. Let’s break down the Total Cost of Ownership (TCO).

16.1. The KernelSHAP Tax

KernelSHAP complexity is $O(N_{samples} \times K_{features} \times M_{background})$.

Shadow Mode: Clarify spins up a shadow endpoint. You pay for the underlying instance.
Inference Volume: For 1 Million rows, with $num_samples=100$, you are performing 100 Million Inferences.
Cost:
- Instance: ml.m5.2xlarge ($0.46/hr).
- Throughput: 100 predictions/sec.
- Time: $100,000,000 / 100 / 3600 \approx 277$ hours.
- Job Cost: $277 \times 0.46 \approx $127$.

Comparison: Finding the bias in your dataset is cheap (< $5). Calculating SHAP for every single row is expensive (> $100).

16.2. Cost Optimization Calculator Table

Strategy	Accuracy	Speed	Cost (Relative)	Use Case
Full KernelSHAP	High	Slow	$$$$$	Regulatory Audits (Annual)
Sampled KernelSHAP	Med	Med	$$	Monthly Monitoring
TreeSHAP	High	Fast	$	Interactive Dashboards
Partial Dependence	Low	Fast	$	Global Trend Analysis

16.3. The “Lazy Evaluation” Pattern

The most cost-effective architecture is Sampling. Instead of explaining 100% of traffic:

Explain all Errors (False Positives/Negatives).
Explain all Outliers (High Anomaly Score).
Explain a random 1% Sample of the rest.

This reduces compute cost by 95% while catching the most important drift signals.

17. Architecture Cheat Sheet

AWS SageMaker Clarify Reference

Job Type: Processing Job (Containerized).
Input: S3 (CSV/JSON/Parquet).
Compute: Ephemeral cluster (managed).
Artifacts: analysis_config.json, report.pdf.
Key SDK: sagemaker.clarify.
Key IAM: sagemaker:CreateProcessingJob, sagemaker:CreateEndpoint.

GCP Vertex AI XAI Reference

Job Type: Online (API) or Batch Prediction.
Input: Tensor (Online) or BigQuery/GCS (Batch).
Compute: Attached to Endpoint nodes.
Artifacts: explanation_metadata.json.
Key SDK: google.cloud.aiplatform.
Key IAM: aiplatform.endpoints.explain.

18. Final Summary

Cloud XAI tools remove the “infrastructure heavy lifting” of explainability.

Use Bias Detection (Clarify) to protect your company from reputational risk before you ship.
Use Online Explanations (Vertex AI) to build trust features into your user-facing apps.
Use Governance workflows (Pipelines) to ensure no model reaches production without a signed Fairness Audit.

The era of “The algorithm did it” is over. With these tools, you are now accountable for exactly what the algorithm did, and why.

19. Frequently Asked Questions (FAQ)

Q: Can I use Clarify for Computer Vision? A: Yes, SageMaker Clarify recently added support for Computer Vision. It can explain Object Detection and Image Classification models by aggregating pixel Shapley values into superpixels (similar to XRAI). You must provide the data in application/x-image format.

Q: Does Vertex AI support custom containers? A: Yes. As long as your container exposes a Health route and a Predict route, you can wrap it. However, for Explainability, you must adhere to the explanation_metadata.json contract strictly so Vertex knows which tensors are inputs and outputs.

Q: Is “Bias” the same as “Fairness”? A: No. Bias is a statistical property (e.g., “The training set has 90% Men”). Fairness is a social/ethical definition (e.g., “Hiring decisions should not depend on Gender”). Clarify measures Bias; humans decide if the result is Unfair.

Q: Can I run this locally? A: You can run shap locally. You cannot run “Clarify” locally (it’s a managed container service). You can, however, pull the generic Clarify docker image from ECR to your laptop for testing, but you lose the managed IAM/S3 integration.

Q: Does this work for LLMs? A: Yes, keeping in mind the tokens vs words distinction. AWS fmeval is the preferred tool for LLMs over standard Clarify.

20. Migration Guide: From Laptop to Cloud

How do you take the shap code from your notebook (Chapter 19.1) and deploy it to a Clarify job (Chapter 19.2)?

Step 1: Externalize the Baseline

Laptop: explainer = shap.Explainer(model, X_train) (Data in RAM).
Cloud: Save X_train (or a K-Means summary) to s3://bucket/baseline.csv.

Step 2: Formalize the Config

Laptop: You tweak parameters in the cell.
Cloud: You define a static JSON/Python Dict config. This forces you to decide on num_samples and agg_method explicitly and commit them to git.

Step 3: Decouple the Model

Laptop: Model object is in memory.
Cloud: Model must be a serialized artifact (model.tar.gz) stored in S3 and registered in the SageMaker Model Registry. This ensures reproducibility.

Step 4: Automate the Trigger

Laptop: You run the cell manually.
Cloud: You add a ClarifyCheckStep to your Pipeline. Now the analysis runs automatically every time the model is retrained.

21. Glossary of Cloud XAI Terms

Analysis Config: The JSON definition telling SageMaker Clarify what to compute (Bias method, SHAP config).
Facet: In AWS terminology, the Protected Attribute (e.g., Age, Gender).
Shadow Endpoint: An ephemeral inference server spun up by Clarify solely for the purpose of being queried by the explainer perturbation engine. It is deleted immediately after the job.
Explanation Metadata: In GCP, the JSON file that maps the raw tensors of a TensorFlow/PyTorch model to human-readable concepts like “Image” or “Text”.
Instance Output Value: The raw prediction score returned by the model for a specific instance, which SHAP decomposes.
Baseline (Reference): The “background” dataset against which the current instance is compared. For images, often a black image. For tabular, the average customer.

22. Further Reading & Whitepapers

1. “Amazon SageMaker Clarify: Model Explainability and Bias Detection”

AWS Technical Paper: Deep dive into the container architecture and the specific implementation of KernelSHAP used by AWS.

2. “AI Explanations (AIX) Whitepaper”

Google Cloud: Explains the math behind Sampled Shapley and Integrated Gradients as implemented in Vertex AI.

3. “Model Cards for Model Reporting”

Mitchell et al. (2019): The paper that inspired the “Model Card” feature in both clouds—a documentation standard for transparent reporting of model limitations.

4. “NIST AI Risk Management Framework (AI RMF 1.0)”

NIST (2023): The US government standard for AI safety. Clarify and Vertex AI are designed to help organizations meet the “Map”, “Measure”, and “Manage” functions of this framework.

Final Checklist for Production

Baseline Defined: Do not use zero-imputation. Use K-Means.
IAM Secured: Least privilege access to raw training data.
Costs Estimated: Calculate estimated compute hours before running on 10M rows.
Pipeline Integrated: Make it a blocking gate in CI/CD.
Legal Reviewed: Have legal counsel review the definition of “Bias” (e.g., 80% rule) for your specific jurisdiction.

23. Complete Terraform Module Reference

For the DevOps engineers, here is a reusable Terraform module structure for deploying a standard Explainability stack on AWS.

# modules/clarify_stack/main.tf

variable "project_name" {
  type = string
}

variable "vpc_id" {
  type = string
}

# 1. The Secure Bucket
resource "aws_s3_bucket" "clarify_bucket" {
  bucket = "${var.project_name}-clarify-artifacts"
  acl    = "private"
  
  versioning {
    enabled = true
  }
  
  server_side_encryption_configuration {
    rule {
      apply_server_side_encryption_by_default {
        sse_algorithm = "AES256"
      }
    }
  }
}

# 2. The Execution Role
resource "aws_iam_role" "clarify_exec" {
  name = "${var.project_name}-clarify-role"
  
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action = "sts:AssumeRole"
      Effect = "Allow"
      Principal = { Service = "sagemaker.amazonaws.com" }
    }]
  })
}

# 3. Security Group for Network Isolation
resource "aws_security_group" "clarify_sg" {
  name        = "${var.project_name}-clarify-sg"
  description = "Security group for Clarify processing jobs"
  vpc_id      = var.vpc_id

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["10.0.0.0/8"] # Allow internal traffic only (No Internet)
  }
}

# 4. Outputs
output "role_arn" {
  value = aws_iam_role.clarify_exec.arn
}

output "bucket_name" {
  value = aws_s3_bucket.clarify_bucket.id
}

output "security_group_id" {
  value = aws_security_group.clarify_sg.id
}

Usage:

module "risk_model_xai" {
  source       = "./modules/clarify_stack"
  project_name = "credit-risk-v1"
  vpc_id       = "vpc-123456"
}

This ensures that every new model project gets a standardized, secure foundation for its explainability work.

Keyboard shortcuts

The MLOps Omni-Reference