Keyboard shortcuts

Press or to navigate between chapters

Press ? to show this help

Press Esc to hide this help

32.2. Governance Tools: SageMaker vs. Vertex AI

Note

Executive Summary: Governance is not achieved by spreadsheets; it is achieved by platform-native tooling. This section provides a deep comparative analysis of AWS SageMaker Governance tools and GCP Vertex AI Metadata. We explore how to automate the collection of governance artifacts using these managed services.

Governance tools in the cloud have evolved from basic logging to sophisticated “Control Planes” that track the entire lifecycle of a model. These tools answer the three critical questions of MLOps Governance:

  1. Who did it? (Identity & Access)
  2. What did they do? (Lineage & Metadata)
  3. How does it behave? (Performance & Quality)

32.2.1. AWS SageMaker Governance Ecosystem

AWS has introduced a suite of specific tools under the “SageMaker Governance” umbrella.

1. SageMaker Role Manager

Standard IAM roles for Data Science are notoriously difficult to scope correctly. AdministratorAccess is too broad; specific S3 bucket policies are too tedious to maintain manually for every project.

SageMaker Role Manager creates distinct personas:

  • Data Scientist: Can access Studio, run experiments, but cannot deploy to production.
  • MLOps Engineer: Can build pipelines, manage registries, and deploy endpoints.
  • Compute Worker: The machine verification role (assumed by EC2/Training Jobs).

Terraform Implementation: Instead of crafting JSON policies, use the Governance constructs.

# Example: Creating a Data Scientist Persona Role
resource "aws_sagemaker_servicecatalog_portfolio_status" "governance_portfolio" {
  status = "Enabled"
}

# Note: Role Manager is often configured via Console/API initially or custom IAM modules
# A typical sophisticated IAM policy for a Data Scientist restricts them to specific VPCs
resource "aws_iam_role" "data_scientist_role" {
  name = "SageMakerDataScientistPolicy"
  assume_role_policy = data.aws_iam_policy_document.sagemaker_assume_role.json
}

resource "aws_iam_policy" "strict_s3_access" {
  name = "StrictS3AccessForDS"
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = ["s3:GetObject", "s3:PutObject"]
        Effect = "Allow"
        Resource = [
          "arn:aws:s3:::my-datalake-clean/*",
          "arn:aws:s3:::my-artifact-bucket/*"
        ]
        # Governance: Deny access to raw PII buckets
      },
      {
        Action = "s3:*"
        Effect = "Deny"
        Resource = "arn:aws:s3:::my-datalake-pii/*"
      }
    ]
  })
}

2. SageMaker Model Cards

Model Cards are “nutrition labels” for models. In AWS, these are structured JSON objects that can be versioned and PDF-exported.

  • Intended Use: What is this model for?
  • Risk Rating: High/Medium/Low.
  • Training Details: Hyperparameters, datasets, training job ARNs.
  • Evaluation Observations: Accuracy metrics, bias reports from Clarify.

Automation via Python SDK: Do not ask Data Scientists to fill these out manually. Auto-populate them from the training pipeline.

import boto3
from sagemaker import session
from sagemaker.model_card import (
    ModelCard,
    ModelOverview,
    IntendedUses,
    TrainingDetails,
    ModelPackage,
    EvaluationDetails
)

def create_automated_card(model_name, s3_output_path, metrics_dict):
    # 1. Define Overview
    overview = ModelOverview(
        model_name=model_name,
        model_description="Credit Risk XGBoost Model V2",
        problem_type="Binary Classification",
        algorithm_type="XGBoost",
        model_creator="Risk Team",
        model_owner="Chief Risk Officer"
    )

    # 2. Define Intended Use (Critical for EU AI Act)
    intended_uses = IntendedUses(
        purpose_of_model="Assess loan applicant default probability.",
        intended_uses="Automated approval for loans < $50k. Human review for > $50k.",
        factored_into_decision="Yes, combined with FICO score.",
        risk_rating="High"
    )

    # 3. Create Card
    card = ModelCard(
        name=f"{model_name}-card",
        status="PendingReview",
        model_overview=overview,
        intended_uses=intended_uses,
        # Link to the actual Model Registry Package
        model_package_details=ModelPackage(
             model_package_arn="arn:aws:sagemaker:us-east-1:123456789012:model-package/credit-risk/1"
        )
    )
    
    # 4. Save
    card.create()
    print(f"Model Card {card.name} created. Status: PendingReview")

# This script runs inside the SageMaker Pipeline "RegisterModel" step.

3. SageMaker Model Dashboard

This gives a “Single Pane of Glass” view.

  • Drift Status: Is data drift detected? (Integrated with Model Monitor).
  • Quality Status: Is accuracy degrading?
  • Compliance Status: Does it have a Model Card?

Operational Usage: The Model Dashboard is the primary screen for the Model Risk Officer. They can verify that every model currently serving traffic in prod has green checks for “Card” and “Monitor”.

32.2.2. GCP Vertex AI Metadata & Governance

Google Cloud takes a lineage-first approach using ML Metadata (MLMD), an implementation of the open-source library that powers TensorFlow Extended (TFX).

1. Vertex AI Metadata (The Graph)

Everything in Vertex AI is a node in a directed graph.

  • Artifacts: Datasets, Models, Metrics (Files).
  • Executions: Training Jobs, Preprocessing Steps (Runs).
  • Contexts: Experiments, Pipelines (Groupings).

This graph is automatically built if you use Vertex AI Pipelines. You can query it to answer: “Which dataset version trained Model X?”

Querying Lineage Programmatically:

from google.cloud import aiplatform
from google.cloud.aiplatform.metadata import schema

def trace_model_lineage(model_resource_name):
    aiplatform.init(project="my-project", location="us-central1")
    
    # Get the Artifact representing the model
    # Note: You usually look this up by URI or tag
    model_artifact = aiplatform.Artifact.get(resource_name=model_resource_name)
    
    print(f"Tracing lineage for: {model_artifact.display_name}")
    
    # Get the Execution that produced this artifact
    executions = model_artifact.get_executions(direction="upstream")
    
    for exc in executions:
        print(f"Produced by Execution: {exc.display_name} (Type: {exc.schema_title})")
        
        # Who fed into this execution?
        inputs = exc.get_artifacts(direction="upstream")
        for inp in inputs:
            print(f"  <- Input Artifact: {inp.display_name} (Type: {inp.schema_title})")
            if "Dataset" in inp.schema_title:
                print(f"     [DATA FOUND]: {inp.uri}")

# Output:
# Tracing lineage for: fraud-model-v5
# Produced by Execution: training-job-xgboost-83jd9 (Type: system.Run)
#   <- Input Artifact: cleansed-data-v5 (Type: system.Dataset)
#      [DATA FOUND]: gs://my-bucket/processed/2023-10/train.csv

2. Vertex AI Model Registry

Similar to AWS, but tightly integrated with the Metadata store.

  • Versioning: v1, v2, v3…
  • Aliasing: default, challenger, production.
  • Evaluation Notes: You can attach arbitrary functional performance metrics to the registry entry.

3. Governance Policy Enforcement (Org Policy)

GCP allows you to set Organization Policies that restrict AI usage at the resource level.

  • Constraint: constraints/aiplatform.restrictVpcPeering (Ensure models only deploy to private VPCs).
  • Constraint: constraints/gcp.resourceLocations (Ensure data/models stay in europe-west3 for GDPR).

32.2.3. Comparing the Approaches

FeatureAWS SageMaker GovernanceGCP Vertex AI Governance
PhilosophyDocument-Centric: Focus on Model Cards, PDF exports, and Review Workflows.Graph-Centric: Focus on immutable lineage, metadata tracking, and graph queries.
Model CardsFirst-class citizen. Structured Schema. Good UI support.Supported via Model Registry metadata, but less “form-based” out of the box.
LineageProvenance provided via SageMaker Experiments and Pipelines.Deep integration via ML Metadata (MLMD). Standardized TFX schemas.
Access ControlRole Manager simplifies IAM. Granular Service Control Policies (SCP).IAM + VPC Service Controls. Org Policies for location/resource constraints.
Best For…Highly regulated industries (Finance/Health) needing formal “documents” for auditors.Engineering-heavy teams needing deep automated traceability and debugging.

32.2.4. Governance Dashboard Architecture

You ultimately need a custom dashboard that aggregates data from these cloud tools for your C-suite. Do not force the CEO to log into the AWS Console.

The “Unified Governance Limit” Dashboard: Build a lightweight internal web app (Streamlit/Backstage) that pulls data from AWS/GCP APIs.

Key Metrics to Display:

  1. Deployment Velocity: Deployments per week.
  2. Governance Debt: % of Production Models missing a Model Card.
  3. Risk Exposure: breakdown of models by Risk Level (High/Med/Low).
  4. Incident Rate: % of inference requests resulting in 5xx errors or fallback.
# Streamlit Dashboard Snippet (Hypothetical)
import streamlit as st
import pandas as pd

st.title("Enterprise AI Governance Portal")

# Mock data - in reality, query Boto3/Vertex SDK
data = [
    {"Model": "CreditScore", "Version": "v4", "Risk": "High", "Card": "✅", "Bias_Check": "✅", "Status": "Prod"},
    {"Model": "ChatBot", "Version": "v12", "Risk": "Low", "Card": "✅", "Bias_Check": "N/A", "Status": "Prod"},
    {"Model": "FraudDetect", "Version": "v2", "Risk": "High", "Card": "❌", "Bias_Check": "❌", "Status": "Staging"}
]
df = pd.DataFrame(data)

st.dataframe(df.style.applymap(lambda v: 'color: red;' if v == '❌' else None))

st.metric("Governance Score", "85%", "-5%")

32.2.5. Conclusion

Tooling is the enforcement arm of policy.

  • Use SageMaker Model Cards to satisfy the documentation requirements of the EU AI Act.
  • Use Vertex AI Metadata to satisfy the data lineage requirements of NIST RMF.
  • Automate the creation of these artifacts in your CI/CD pipeline; relying on humans to fill out forms is a governance failure mode.

[Previous content preserved…]

32.2.6. Deep Dive: Global Tagging Taxonomy for Governance

Governance starts with metadata. If your resources are not tagged, you cannot govern them. You must enforce a Standard Tagging Policy via AWS Organizations (SCP) or Azure Policy.

The Foundation Tags

Every cloud resource (S3 Bucket, SageMaker Endpoint, ECR Repo) MUST have these tags:

Tag KeyExample ValuesPurpose
gov:data_classificationpublic, internal, confidential, restrictedDetermines security controls (e.g., encryption, public access).
gov:ownerteam-risk, team-marketingWho to page when it breaks.
gov:environmentdev, staging, prodControls release promotion gates.
gov:cost_centercc-12345Chargeback.
gov:compliance_scopepci, hipaa, sox, noneTriggers specific audit logging rules.

Terraform Implementation of Tag Enforcement:

# Standardize tags in a local variable
locals {
  common_tags = {
    "gov:owner" = "team-mlops"
    "gov:environment" = var.environment
    "gov:iac_repo" = "github.com/org/infra-ml"
  }
}

resource "aws_sagemaker_model" "example" {
  name = "my-model"
  execution_role_arn = aws_iam_role.example.arn
  
  tags = merge(local.common_tags, {
    "gov:data_classification" = "confidential"
    "gov:model_version" = "v1.2"
  })
}

32.2.7. Building a Custom Governance Dashboard (The Frontend)

While cloud consoles are great for engineers, the risk committee needs a simplified view. Here is a blueprint for a React/TypeScript dashboard that consumes your Metadata Store.

GovernanceCard Component:

import React from 'react';
import { Card, Badge, Table } from 'antd';

interface ModelGovernanceProps {
  modelName: string;
  riskLevel: 'High' | 'Medium' | 'Low';
  complianceChecks: {
    biasParams: boolean;
    loadTest: boolean;
    humanReview: boolean;
  };
}

export const GovernanceCard: React.FC<ModelGovernanceProps> = ({ modelName, riskLevel, complianceChecks }) => {
  const isCompliant = Object.values(complianceChecks).every(v => v);

  return (
    <Card 
      title={modelName} 
      extra={isCompliant ? <Badge status="success" text="Compliant" /> : <Badge status="error" text="Violation" />}
      style={{ width: 400, margin: 20 }}
    >
      <p>Risk Level: <b style={{ color: riskLevel === 'High' ? 'red' : 'green' }}>{riskLevel}</b></p>
      
      <Table 
        dataSource={[
          { key: '1', check: 'Bias Parameters Validated', status: complianceChecks.biasParams },
          { key: '2', check: 'Load Test Passed', status: complianceChecks.loadTest },
          { key: '3', check: 'Human Review Sign-off', status: complianceChecks.humanReview },
        ]}
        columns={[
          { title: 'Control', dataIndex: 'check', key: 'check' },
          { 
            title: 'Status', 
            dataIndex: 'status', 
            key: 'status',
            render: (passed) => passed ? '✅' : '❌' 
          }
        ]}
        pagination={false}
        size="small"
      />
    </Card>
  );
};

Backend API (FastAPI) to feed the Dashboard: You need an API that queries AWS/GCP and aggregates the status.

# main.py
from fastapi import FastAPI
import boto3

app = FastAPI()
sm_client = boto3.client('sagemaker')

@app.get("/api/governance/models")
def get_governance_data():
    # 1. List all models
    models = sm_client.list_models()['Models']
    results = []
    
    for m in models:
        name = m['ModelName']
        tags = sm_client.list_tags(ResourceArn=m['ModelArn'])['Tags']
        
        # Parse tags into dict
        tag_dict = {t['Key']: t['Value'] for t in tags}
        
        # Check compliance logic
        compliance = {
            "biasParams": "bias-check-complete" in tag_dict,
            "loadTest": "load-test-passed" in tag_dict,
            "humanReview": "approved-by" in tag_dict
        }
        
        results.append({
            "modelName": name,
            "riskLevel": tag_dict.get("gov:risk_level", "Unknown"),
            "complianceChecks": compliance
        })
        
    return results

32.2.8. Advanced Vertex AI Metadata: The gRPC Store

Under the hood, Vertex AI Metadata uses ML Metadata (MLMD), which is a gRPC service sitting on top of a SQL database (Cloud SQL). For advanced users, interacting directly with the MLMD store allows for complex graph queries.

The Context-Execution-Artifact Triad:

  1. Context: A grouping (e.g., “Experiment 42”).
  2. Execution: An action (e.g., “Train XGBoost”).
  3. Artifact: A file (e.g., model.bst).

Querying the Graph: “Find all Models trained using Data that originated from S3 Bucket ‘raw-pii’.”

This is a recursive graph traversal problem.

  1. Find Artifact (Bucket) matching metadata uri LIKE 's3://raw-pii%'.
  2. Find downstream Executions.
  3. Find output Artifacts of those Executions.
  4. Repeat until you hit an Artifact of type Model.

This traversal allows you to perform Impact Analysis: “If I find a bug in the raw data ingestion code (v1.1), which 50 models currently in production need to be retrained?”

32.2.9. Databricks Unity Catalog vs. Cloud Native

If you use Databricks, Unity Catalog provides a unified governance layer across Data and AI.

  • Unified Namespace: catalog.schema.table works for tables and models (catalog.schema.model_v1).
  • Lineage: Automatically captures table-to-model lineage.
  • Grants: Uses standard SQL GRANT syntax for models. GRANT EXECUTE ON MODEL my_model TO group_data_scientists.

Comparison:

  • AWS/GCP: Infrastructure-centric. Robust IAM. Great for Ops.
  • Databricks: Data-centric. Great for Analytics/SQL users.

32.2.10. Case Study: Implementing specific Service Control Policies (SCPs)

To govern effectively, you must prevent “Shadow Ops.” Here is an AWS SCP (Service Control Policy) applied to the root organization that bans creating public S3 buckets or unencrypted SageMaker notebooks.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyPublicS3Buckets",
      "Effect": "Deny",
      "Action": [
        "s3:PutBucketPublicAccessBlock",
        "s3:PutBucketPolicy"
      ],
      "Resource": "*",
      "Condition": {
        "StringEquals": { "sagemaker:ResourceTag/gov:data_classification": "restricted" }
      }
    },
    {
      "Sid": "RequireKMSForNotebooks",
      "Effect": "Deny",
      "Action": "sagemaker:CreateNotebookInstance",
      "Resource": "*",
      "Condition": {
        "Null": { "sagemaker:KmsKeyId": "true" }
      }
    }
  ]
}

This ensures that even if a Data Scientist has “Admin” rights in their account, they physically cannot create an unencrypted notebook. This is Guardrails > Guidelines.

32.2.11. Summary

Governance Tools are the nervous system of your MLOps body.

  1. Tag everything: Use a rigid taxonomy.
  2. Visualize: Build dashboards for non-technical stakeholders.
  3. Enforce: Use SCPs and OPA to block non-compliant actions at the API level.
  4. Trace: Use Metadata stores to perform impact analysis.

[End of Section 32.2]