Chapter 19.2: Behavioral Testing: Invariance, Directionality, and Minimum Functionality Tests (MFTs)
In the previous section, we discussed the pyramid of ML testing, which provides a structural overview of what to test. In this section, we dive deep into the how of model quality testing, moving beyond aggregate performance metrics to a more nuanced and granular evaluation of model behavior.
A model that achieves 99% accuracy can still harbor critical, systematic failures for specific subpopulations of data or fail in predictable and embarrassing ways. A high F1-score tells you that your model is performing well on average on your test set, but it doesn’t tell you how it’s achieving that performance or what its conceptual understanding of the problem is.
This is the domain of Behavioral Testing. Inspired by the principles of unit testing in traditional software engineering, behavioral testing evaluates a model’s capabilities on specific, well-defined phenomena.
Important
Instead of asking, “How accurate is the model?”, we ask, “Can the model handle negation?”, “Is the model invariant to changes in gendered pronouns?”, or “Does the loan approval score monotonically increase with income?”
19.2.1. The Behavioral Testing Framework
graph TB
subgraph "Traditional Testing"
A[Test Dataset] --> B[Model]
B --> C[Aggregate Metrics]
C --> D[Accuracy: 94%]
end
subgraph "Behavioral Testing"
E[Capability Tests] --> F[Model]
F --> G{Pass/Fail per Test}
G --> H[Invariance: 98%]
G --> I[Directionality: 87%]
G --> J[MFT Negation: 62%]
end
Why Behavioral Testing Matters
| Approach | What It Measures | Blind Spots |
|---|---|---|
| Traditional Metrics | Average performance on held-out data | Failure modes on edge cases |
| Slice Analysis | Performance on subgroups | Doesn’t test causal understanding |
| Behavioral Testing | Specific capability adherence | Requires human-defined test cases |
The CheckList Framework
The seminal paper “Beyond Accuracy: Behavioral Testing of NLP Models with CheckList” (Ribeiro et al., 2020) introduced this framework:
- Minimum Functionality Tests (MFTs): Simple sanity checks that any model should pass
- Invariance Tests (INV): Model output shouldn’t change for semantically-equivalent inputs
- Directional Expectation Tests (DIR): Model output should change predictably for certain input changes
19.2.2. Invariance Tests: Testing for What Shouldn’t Matter
The core principle of an invariance test is simple: a model’s prediction should be robust to changes in the input that do not alter the fundamental meaning or context of the data point.
If a sentiment classifier labels “The food was fantastic” as positive, it should also label “The food was fantastic, by the way” as positive.
Common Invariance Categories
graph LR
subgraph "NLP Invariances"
A[Entity Swapping]
B[Typo Introduction]
C[Case Changes]
D[Neutral Fillers]
E[Paraphrasing]
end
subgraph "CV Invariances"
F[Brightness/Contrast]
G[Small Rotation]
H[Minor Cropping]
I[Background Changes]
end
subgraph "Tabular Invariances"
J[Feature Order]
K[Irrelevant Perturbation]
L[Unit Conversions]
end
NLP Invariance Examples
| Invariance Type | Original | Perturbed | Expected |
|---|---|---|---|
| Entity Swap | “Alice loved the movie” | “Bob loved the movie” | Same prediction |
| Typo | “The service was excellent” | “The servise was excelent” | Same prediction |
| Case | “GREAT PRODUCT” | “great product” | Same prediction |
| Filler | “Good food” | “Good food, you know” | Same prediction |
| Synonym | “The film was boring” | “The movie was boring” | Same prediction |
Computer Vision Invariance Examples
| Invariance Type | Transformation | Bounds |
|---|---|---|
| Rotation | Random rotation | ±5 degrees |
| Brightness | Brightness adjustment | ±10% |
| Crop | Edge cropping | ≤5% of image |
| Blur | Gaussian blur | σ ≤ 0.5 |
| Noise | Salt-and-pepper | ≤1% of pixels |
Implementation: Invariance Testing Framework
# invariance_testing.py - Production-ready invariance testing
from dataclasses import dataclass, field
from typing import Callable, List, Dict, Any, Tuple
from abc import ABC, abstractmethod
import numpy as np
from enum import Enum
import random
import re
class InvarianceType(Enum):
ENTITY_SWAP = "entity_swap"
TYPO = "typo_introduction"
CASE_CHANGE = "case_change"
FILLER_WORDS = "filler_words"
SYNONYM = "synonym_replacement"
PARAPHRASE = "paraphrase"
@dataclass
class InvarianceTestResult:
"""Result of a single invariance test."""
original_input: Any
perturbed_input: Any
original_prediction: Any
perturbed_prediction: Any
invariance_type: InvarianceType
passed: bool
delta: float = 0.0
def __str__(self):
status = "✅ PASS" if self.passed else "❌ FAIL"
return f"{status} | {self.invariance_type.value} | Δ={self.delta:.4f}"
@dataclass
class InvarianceTestSuite:
"""A suite of invariance tests for a specific capability."""
name: str
description: str
invariance_type: InvarianceType
perturbation_fn: Callable[[str], str]
test_cases: List[str] = field(default_factory=list)
tolerance: float = 0.01 # Maximum allowed change in prediction
class TextPerturbations:
"""Library of text perturbation functions."""
# Common first names for entity swapping
FIRST_NAMES = [
"Alice", "Bob", "Charlie", "Diana", "Eve", "Frank",
"Grace", "Henry", "Iris", "Jack", "Kate", "Liam",
"Maria", "Noah", "Olivia", "Peter", "Quinn", "Rose"
]
# Common typo patterns
TYPO_CHARS = {
'a': ['s', 'q', 'z'],
'e': ['w', 'r', 'd'],
'i': ['u', 'o', 'k'],
'o': ['i', 'p', 'l'],
'u': ['y', 'i', 'j']
}
# Neutral filler phrases
FILLERS = [
", you know",
", honestly",
", to be fair",
" basically",
" actually",
", in my opinion"
]
@classmethod
def swap_entity(cls, text: str) -> str:
"""Swap named entities with random alternatives."""
result = text
for name in cls.FIRST_NAMES:
if name in result:
replacement = random.choice([n for n in cls.FIRST_NAMES if n != name])
result = result.replace(name, replacement)
break
return result
@classmethod
def swap_entity_by_gender(cls, text: str) -> Tuple[str, str]:
"""Swap gendered names and return both versions."""
male_names = ["James", "John", "Michael", "David"]
female_names = ["Mary", "Sarah", "Jennifer", "Emily"]
male_version = text
female_version = text
for m, f in zip(male_names, female_names):
if m in text:
male_version = text.replace(m, m) # Keep as is
female_version = text.replace(m, f)
break
if f in text:
male_version = text.replace(f, m)
female_version = text.replace(f, f)
break
return male_version, female_version
@classmethod
def introduce_typo(cls, text: str, probability: float = 0.1) -> str:
"""Introduce realistic typos."""
words = text.split()
result = []
for word in words:
if len(word) > 3 and random.random() < probability:
# Pick a random character to modify
idx = random.randint(1, len(word) - 2)
char = word[idx].lower()
if char in cls.TYPO_CHARS:
replacement = random.choice(cls.TYPO_CHARS[char])
word = word[:idx] + replacement + word[idx+1:]
result.append(word)
return " ".join(result)
@classmethod
def change_case(cls, text: str) -> str:
"""Change case in various ways."""
strategies = [
str.lower,
str.upper,
str.title,
lambda x: x.swapcase()
]
return random.choice(strategies)(text)
@classmethod
def add_filler(cls, text: str) -> str:
"""Add neutral filler words/phrases."""
filler = random.choice(cls.FILLERS)
# Insert at sentence boundary or end
if "." in text:
parts = text.rsplit(".", 1)
return f"{parts[0]}{filler}.{parts[1] if len(parts) > 1 else ''}"
else:
return text + filler
class InvarianceTester:
"""
Run invariance tests against an ML model.
Ensures model predictions are stable under semantically-equivalent transformations.
"""
def __init__(
self,
model: Any,
predict_fn: Callable[[Any, Any], float],
tolerance: float = 0.01
):
"""
Args:
model: The ML model to test
predict_fn: Function that takes (model, input) and returns prediction score
tolerance: Maximum allowed change in prediction for invariance to pass
"""
self.model = model
self.predict_fn = predict_fn
self.tolerance = tolerance
self.results: List[InvarianceTestResult] = []
def test_invariance(
self,
original: str,
perturbed: str,
invariance_type: InvarianceType
) -> InvarianceTestResult:
"""Test invariance for a single input pair."""
original_pred = self.predict_fn(self.model, original)
perturbed_pred = self.predict_fn(self.model, perturbed)
delta = abs(original_pred - perturbed_pred)
passed = delta <= self.tolerance
result = InvarianceTestResult(
original_input=original,
perturbed_input=perturbed,
original_prediction=original_pred,
perturbed_prediction=perturbed_pred,
invariance_type=invariance_type,
passed=passed,
delta=delta
)
self.results.append(result)
return result
def run_suite(
self,
test_cases: List[str],
perturbation_fn: Callable[[str], str],
invariance_type: InvarianceType
) -> Dict[str, Any]:
"""Run a full invariance test suite."""
suite_results = []
for original in test_cases:
perturbed = perturbation_fn(original)
result = self.test_invariance(original, perturbed, invariance_type)
suite_results.append(result)
# Calculate aggregate metrics
passed = sum(1 for r in suite_results if r.passed)
total = len(suite_results)
pass_rate = passed / total if total > 0 else 0
return {
"invariance_type": invariance_type.value,
"total_tests": total,
"passed": passed,
"failed": total - passed,
"pass_rate": pass_rate,
"mean_delta": np.mean([r.delta for r in suite_results]),
"max_delta": max(r.delta for r in suite_results) if suite_results else 0,
"results": suite_results
}
def run_all_invariances(
self,
test_cases: List[str]
) -> Dict[str, Dict]:
"""Run all standard invariance tests."""
suites = {
InvarianceType.ENTITY_SWAP: TextPerturbations.swap_entity,
InvarianceType.TYPO: TextPerturbations.introduce_typo,
InvarianceType.CASE_CHANGE: TextPerturbations.change_case,
InvarianceType.FILLER_WORDS: TextPerturbations.add_filler,
}
results = {}
for inv_type, perturbation_fn in suites.items():
results[inv_type.value] = self.run_suite(
test_cases, perturbation_fn, inv_type
)
return results
def generate_report(self, results: Dict[str, Dict]) -> str:
"""Generate markdown report."""
report = "# Invariance Test Report\n\n"
report += "| Test Type | Pass Rate | Mean Δ | Max Δ | Status |\n"
report += "|:----------|:----------|:-------|:------|:-------|\n"
all_passed = True
for test_type, data in results.items():
pass_rate = data["pass_rate"]
status = "✅" if pass_rate >= 0.95 else ("⚠️" if pass_rate >= 0.80 else "❌")
if pass_rate < 0.95:
all_passed = False
report += (
f"| {test_type} | {pass_rate:.1%} | "
f"{data['mean_delta']:.4f} | {data['max_delta']:.4f} | {status} |\n"
)
overall = "✅ PASSED" if all_passed else "❌ FAILED"
report += f"\n**Overall Status**: {overall}\n"
return report
# Example usage
def example_predict(model, text: str) -> float:
"""Example prediction function."""
# In production, this would call your actual model
return 0.85
# Test execution
tester = InvarianceTester(
model=None, # Your model
predict_fn=example_predict,
tolerance=0.05
)
test_cases = [
"The customer service was excellent today.",
"Alice really enjoyed the new restaurant.",
"The product quality exceeded my expectations.",
]
results = tester.run_all_invariances(test_cases)
print(tester.generate_report(results))
Fairness-Critical Invariance Tests
# fairness_invariance.py - Testing for demographic fairness
class FairnessInvarianceTester:
"""
Test model invariance to protected attributes.
These tests are CRITICAL for compliance with ECOA (lending),
Title VII (employment), and general fairness principles.
"""
PROTECTED_SWAPS = {
"gender": {
"male": ["James", "John", "Michael", "Robert", "William"],
"female": ["Mary", "Jennifer", "Linda", "Patricia", "Elizabeth"],
"pronouns_male": ["he", "him", "his"],
"pronouns_female": ["she", "her", "hers"]
},
"race": {
# Names associated with different demographic groups
# Based on Caliskan et al. audit methodology
"group_a": ["Emily", "Greg", "Meredith", "Brad"],
"group_b": ["Lakisha", "Jamal", "Tamika", "Darnell"]
},
"age": {
"young": ["young professional", "recent graduate", "millennial"],
"old": ["senior professional", "experienced veteran", "seasoned"]
}
}
def __init__(self, model, predict_fn, tolerance: float = 0.01):
self.model = model
self.predict_fn = predict_fn
self.tolerance = tolerance
def test_gender_invariance(
self,
templates: List[str]
) -> Dict:
"""
Test invariance to gender-coded names.
Example template: "The candidate {name} applied for the position."
"""
results = []
male_names = self.PROTECTED_SWAPS["gender"]["male"]
female_names = self.PROTECTED_SWAPS["gender"]["female"]
for template in templates:
if "{name}" not in template:
continue
# Test with male names
male_scores = [
self.predict_fn(self.model, template.format(name=name))
for name in male_names[:3]
]
# Test with female names
female_scores = [
self.predict_fn(self.model, template.format(name=name))
for name in female_names[:3]
]
# Statistical comparison
male_mean = np.mean(male_scores)
female_mean = np.mean(female_scores)
delta = abs(male_mean - female_mean)
results.append({
"template": template,
"male_mean": male_mean,
"female_mean": female_mean,
"delta": delta,
"passed": delta <= self.tolerance
})
# Aggregate
passed = sum(1 for r in results if r["passed"])
return {
"test_type": "gender_invariance",
"total": len(results),
"passed": passed,
"failed": len(results) - passed,
"pass_rate": passed / len(results) if results else 0,
"details": results
}
def test_racial_invariance(
self,
templates: List[str]
) -> Dict:
"""Test invariance to racially-associated names."""
# Similar implementation to gender_invariance
pass
def generate_fairness_report(self, results: Dict) -> str:
"""Generate compliance-ready fairness report."""
report = """
# Model Fairness Invariance Report
## Summary
This report evaluates model predictions for invariance to protected attributes
as required by fair lending (ECOA), fair employment (Title VII), and
AI governance best practices.
## Results
| Protected Attribute | Pass Rate | Max Gap | Status |
|:--------------------|:----------|:--------|:-------|
"""
for attr, data in results.items():
pass_rate = data["pass_rate"]
max_gap = max(r["delta"] for r in data["details"]) if data["details"] else 0
status = "✅ COMPLIANT" if pass_rate == 1.0 else "⚠️ REVIEW REQUIRED"
report += f"| {attr} | {pass_rate:.1%} | {max_gap:.4f} | {status} |\n"
report += """
## Methodology
Testing follows the methodology established in:
- Ribeiro et al. (2020) "Beyond Accuracy: Behavioral Testing of NLP Models"
- Caliskan et al. (2017) "Semantics derived automatically from language corpora"
## Compliance Notes
Failure of these tests may indicate discriminatory behavior and should be
investigated before production deployment.
"""
return report
19.2.3. Directionality Tests: Testing for What Should Matter
While invariance tests check for robustness to irrelevant changes, directionality tests verify that the model’s output changes in an expected direction when a meaningful change is made to the input.
Common Directionality Patterns
| Domain | Input Change | Expected Output Change |
|---|---|---|
| Sentiment | Add intensifier (“good” → “very good”) | Score increases |
| Sentiment | Add negation (“good” → “not good”) | Score decreases |
| Credit | Increase income | Approval probability increases |
| Churn | Increase support tickets | Churn probability increases |
| Object Detection | Increase object size | Confidence increases |
Implementation: Directionality Testing
# directionality_testing.py
from dataclasses import dataclass
from typing import Callable, List, Dict, Any, Tuple
from enum import Enum
class DirectionType(Enum):
INCREASE = "should_increase"
DECREASE = "should_decrease"
FLIP = "should_flip_class"
@dataclass
class DirectionalityTest:
"""A single directionality test case."""
original: Any
modified: Any
expected_direction: DirectionType
description: str
@dataclass
class DirectionalityResult:
"""Result of a directionality test."""
test_case: DirectionalityTest
original_score: float
modified_score: float
passed: bool
actual_delta: float
class SentimentDirectionalityTests:
"""Pre-built directionality tests for sentiment analysis."""
@staticmethod
def intensifier_tests() -> List[DirectionalityTest]:
"""Adding intensifiers should increase sentiment magnitude."""
return [
DirectionalityTest(
original="The movie was good.",
modified="The movie was very good.",
expected_direction=DirectionType.INCREASE,
description="Intensifier 'very' should increase positive sentiment"
),
DirectionalityTest(
original="The service was helpful.",
modified="The service was extremely helpful.",
expected_direction=DirectionType.INCREASE,
description="Intensifier 'extremely' should increase positive sentiment"
),
DirectionalityTest(
original="The food was bad.",
modified="The food was terrible.",
expected_direction=DirectionType.DECREASE,
description="Stronger negative word should decrease sentiment"
),
]
@staticmethod
def negation_tests() -> List[DirectionalityTest]:
"""Adding negation should reverse sentiment direction."""
return [
DirectionalityTest(
original="I love this product.",
modified="I do not love this product.",
expected_direction=DirectionType.DECREASE,
description="Negation should reverse positive sentiment"
),
DirectionalityTest(
original="The weather is beautiful.",
modified="The weather is not beautiful.",
expected_direction=DirectionType.DECREASE,
description="Negation should reverse positive sentiment"
),
DirectionalityTest(
original="I hate waiting in line.",
modified="I don't hate waiting in line.",
expected_direction=DirectionType.INCREASE,
description="Negation should reverse negative sentiment"
),
]
@staticmethod
def comparative_tests() -> List[DirectionalityTest]:
"""Comparatives should show relative sentiment."""
return [
DirectionalityTest(
original="This restaurant is good.",
modified="This restaurant is better than average.",
expected_direction=DirectionType.INCREASE,
description="Positive comparative should increase sentiment"
),
DirectionalityTest(
original="The quality is acceptable.",
modified="The quality is worse than expected.",
expected_direction=DirectionType.DECREASE,
description="Negative comparative should decrease sentiment"
),
]
class TabularDirectionalityTests:
"""Directionality tests for tabular ML models."""
@staticmethod
def credit_approval_tests(base_applicant: Dict) -> List[DirectionalityTest]:
"""
Credit approval should have monotonic relationships with key features.
"""
tests = []
# Income increase
higher_income = base_applicant.copy()
higher_income["annual_income"] = base_applicant["annual_income"] * 1.5
tests.append(DirectionalityTest(
original=base_applicant,
modified=higher_income,
expected_direction=DirectionType.INCREASE,
description="Higher income should increase approval probability"
))
# Credit score increase
higher_credit = base_applicant.copy()
higher_credit["credit_score"] = min(850, base_applicant["credit_score"] + 50)
tests.append(DirectionalityTest(
original=base_applicant,
modified=higher_credit,
expected_direction=DirectionType.INCREASE,
description="Higher credit score should increase approval probability"
))
# Debt-to-income decrease (improvement)
lower_dti = base_applicant.copy()
lower_dti["debt_to_income"] = base_applicant["debt_to_income"] * 0.7
tests.append(DirectionalityTest(
original=base_applicant,
modified=lower_dti,
expected_direction=DirectionType.INCREASE,
description="Lower DTI should increase approval probability"
))
return tests
class DirectionalityTester:
"""
Run directionality tests to verify model behaves as expected.
"""
def __init__(
self,
model: Any,
predict_fn: Callable[[Any, Any], float],
min_delta: float = 0.05 # Minimum expected change
):
self.model = model
self.predict_fn = predict_fn
self.min_delta = min_delta
def run_test(
self,
test: DirectionalityTest
) -> DirectionalityResult:
"""Run a single directionality test."""
original_score = self.predict_fn(self.model, test.original)
modified_score = self.predict_fn(self.model, test.modified)
delta = modified_score - original_score
# Check if direction matches expectation
if test.expected_direction == DirectionType.INCREASE:
passed = delta >= self.min_delta
elif test.expected_direction == DirectionType.DECREASE:
passed = delta <= -self.min_delta
else: # FLIP
passed = abs(delta) >= 0.5 # Significant change
return DirectionalityResult(
test_case=test,
original_score=original_score,
modified_score=modified_score,
passed=passed,
actual_delta=delta
)
def run_suite(
self,
tests: List[DirectionalityTest]
) -> Dict[str, Any]:
"""Run a full suite of directionality tests."""
results = [self.run_test(t) for t in tests]
passed = sum(1 for r in results if r.passed)
total = len(results)
return {
"total": total,
"passed": passed,
"failed": total - passed,
"pass_rate": passed / total if total > 0 else 0,
"results": results
}
def generate_report(self, results: Dict) -> str:
"""Generate directionality test report."""
report = "# Directionality Test Report\n\n"
report += f"**Pass Rate**: {results['pass_rate']:.1%} ({results['passed']}/{results['total']})\n\n"
report += "## Detailed Results\n\n"
report += "| Description | Expected | Actual Δ | Status |\n"
report += "|:------------|:---------|:---------|:-------|\n"
for r in results["results"]:
expected = r.test_case.expected_direction.value
status = "✅" if r.passed else "❌"
report += f"| {r.test_case.description[:50]}... | {expected} | {r.actual_delta:+.4f} | {status} |\n"
return report
19.2.4. Minimum Functionality Tests (MFTs): The Unit Tests of ML
Minimum Functionality Tests are the closest ML equivalent to traditional software unit tests. An MFT is a simple, targeted test case designed to check a very specific, atomic capability of the model.
The MFT Philosophy
graph TB
A[Define Model Capabilities] --> B[Create Simple Test Cases]
B --> C[Each Capability: 10-50 tests]
C --> D[100% Expected Pass Rate]
D --> E{All Pass?}
E -->|Yes| F[Deploy]
E -->|No| G[Debug Specific Capability]
Building an MFT Suite
# mft_suite.py - Minimum Functionality Test Framework
from dataclasses import dataclass, field
from typing import List, Dict, Any, Callable
import pytest
@dataclass
class MFTCapability:
"""A capability that the model should possess."""
name: str
description: str
test_cases: List[Dict[str, Any]] = field(default_factory=list)
expected_pass_rate: float = 1.0 # MFTs should have 100% pass rate
class SentimentMFTSuite:
"""Complete MFT suite for sentiment analysis."""
@staticmethod
def basic_positive() -> MFTCapability:
"""Model should correctly identify obviously positive text."""
return MFTCapability(
name="basic_positive",
description="Identify clearly positive sentiment",
test_cases=[
{"input": "I love this!", "expected": "POSITIVE"},
{"input": "This is amazing.", "expected": "POSITIVE"},
{"input": "Absolutely wonderful experience.", "expected": "POSITIVE"},
{"input": "Best purchase ever.", "expected": "POSITIVE"},
{"input": "Highly recommend!", "expected": "POSITIVE"},
{"input": "5 stars, perfect.", "expected": "POSITIVE"},
{"input": "Exceeded all expectations.", "expected": "POSITIVE"},
{"input": "Couldn't be happier.", "expected": "POSITIVE"},
{"input": "Made my day!", "expected": "POSITIVE"},
{"input": "A true masterpiece.", "expected": "POSITIVE"},
]
)
@staticmethod
def basic_negative() -> MFTCapability:
"""Model should correctly identify obviously negative text."""
return MFTCapability(
name="basic_negative",
description="Identify clearly negative sentiment",
test_cases=[
{"input": "I hate this!", "expected": "NEGATIVE"},
{"input": "This is terrible.", "expected": "NEGATIVE"},
{"input": "Absolutely awful experience.", "expected": "NEGATIVE"},
{"input": "Worst purchase ever.", "expected": "NEGATIVE"},
{"input": "Do not recommend.", "expected": "NEGATIVE"},
{"input": "0 stars, horrible.", "expected": "NEGATIVE"},
{"input": "Complete disappointment.", "expected": "NEGATIVE"},
{"input": "Total waste of money.", "expected": "NEGATIVE"},
{"input": "Ruined my day.", "expected": "NEGATIVE"},
{"input": "An utter failure.", "expected": "NEGATIVE"},
]
)
@staticmethod
def negation_handling() -> MFTCapability:
"""Model should correctly handle negation."""
return MFTCapability(
name="negation_handling",
description="Correctly interpret negated sentiment",
test_cases=[
{"input": "This is not good.", "expected": "NEGATIVE"},
{"input": "Not a bad product.", "expected": "POSITIVE"},
{"input": "I don't like this.", "expected": "NEGATIVE"},
{"input": "Not recommended at all.", "expected": "NEGATIVE"},
{"input": "Nothing special about it.", "expected": "NEGATIVE"},
{"input": "Can't complain.", "expected": "POSITIVE"},
{"input": "Not happy with the service.", "expected": "NEGATIVE"},
{"input": "This isn't what I expected.", "expected": "NEGATIVE"},
]
)
@staticmethod
def neutral_detection() -> MFTCapability:
"""Model should correctly identify neutral/factual text."""
return MFTCapability(
name="neutral_detection",
description="Identify neutral or factual statements",
test_cases=[
{"input": "The product is blue.", "expected": "NEUTRAL"},
{"input": "It weighs 5 pounds.", "expected": "NEUTRAL"},
{"input": "Ships from California.", "expected": "NEUTRAL"},
{"input": "Made of plastic.", "expected": "NEUTRAL"},
{"input": "Available in three sizes.", "expected": "NEUTRAL"},
{"input": "Contains 50 pieces.", "expected": "NEUTRAL"},
]
)
@staticmethod
def sarcasm_detection() -> MFTCapability:
"""Model should handle common sarcastic patterns."""
return MFTCapability(
name="sarcasm_detection",
description="Correctly interpret sarcastic text",
test_cases=[
{"input": "Oh great, another delay.", "expected": "NEGATIVE"},
{"input": "Yeah, because that's exactly what I needed.", "expected": "NEGATIVE"},
{"input": "Just what I always wanted, more problems.", "expected": "NEGATIVE"},
{"input": "Wow, thanks for nothing.", "expected": "NEGATIVE"},
],
expected_pass_rate=0.75 # Sarcasm is hard
)
class MFTRunner:
"""Run MFT suites and generate reports."""
def __init__(
self,
model: Any,
predict_label_fn: Callable[[Any, str], str]
):
self.model = model
self.predict_label_fn = predict_label_fn
self.results = {}
def run_capability(
self,
capability: MFTCapability
) -> Dict:
"""Run tests for a single capability."""
results = []
for test_case in capability.test_cases:
predicted = self.predict_label_fn(self.model, test_case["input"])
passed = predicted == test_case["expected"]
results.append({
"input": test_case["input"],
"expected": test_case["expected"],
"predicted": predicted,
"passed": passed
})
num_passed = sum(1 for r in results if r["passed"])
pass_rate = num_passed / len(results) if results else 0
return {
"capability": capability.name,
"description": capability.description,
"total": len(results),
"passed": num_passed,
"pass_rate": pass_rate,
"meets_threshold": pass_rate >= capability.expected_pass_rate,
"required_pass_rate": capability.expected_pass_rate,
"details": results
}
def run_suite(
self,
capabilities: List[MFTCapability]
) -> Dict:
"""Run complete MFT suite."""
capability_results = {}
for cap in capabilities:
capability_results[cap.name] = self.run_capability(cap)
# Aggregate
all_meet_threshold = all(
r["meets_threshold"] for r in capability_results.values()
)
return {
"overall_pass": all_meet_threshold,
"capabilities": capability_results
}
def generate_pytest_file(
self,
capabilities: List[MFTCapability],
output_path: str
):
"""Generate pytest test file from MFT suite."""
code = '''
import pytest
# Auto-generated MFT test file
@pytest.fixture(scope="session")
def model():
# Load your model here
from my_model import load_model
return load_model()
'''
for cap in capabilities:
test_cases = [
(tc["input"], tc["expected"])
for tc in cap.test_cases
]
code += f'''
# {cap.description}
@pytest.mark.parametrize("text,expected", {test_cases})
def test_{cap.name}(model, text, expected):
predicted = model.predict_label(text)
assert predicted == expected, f"Input: {{text}}, Expected: {{expected}}, Got: {{predicted}}"
'''
with open(output_path, 'w') as f:
f.write(code)
return output_path
19.2.5. Integrating Behavioral Tests into CI/CD
Pipeline Integration
# .github/workflows/ml-behavioral-tests.yml
name: ML Behavioral Tests
on:
push:
paths:
- 'models/**'
- 'tests/behavioral/**'
pull_request:
paths:
- 'models/**'
jobs:
behavioral-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install Dependencies
run: |
pip install -r requirements-test.txt
- name: Download Model Artifact
run: |
# Download from model registry
mlflow artifacts download -r ${{ secrets.MLFLOW_RUN_ID }} -d ./model
- name: Run Invariance Tests
run: |
pytest tests/behavioral/test_invariance.py \
--junitxml=reports/invariance.xml \
--html=reports/invariance.html
- name: Run Directionality Tests
run: |
pytest tests/behavioral/test_directionality.py \
--junitxml=reports/directionality.xml
- name: Run MFT Suite
run: |
pytest tests/behavioral/test_mft.py \
--junitxml=reports/mft.xml
- name: Generate Combined Report
run: |
python scripts/combine_behavioral_reports.py \
--output reports/behavioral_summary.json
- name: Check Quality Gates
run: |
python scripts/check_quality_gates.py \
--config quality_gates.yaml \
--results reports/behavioral_summary.json
- name: Upload Reports
uses: actions/upload-artifact@v3
with:
name: behavioral-test-reports
path: reports/
Quality Gate Configuration
# quality_gates.yaml
behavioral_tests:
invariance:
entity_swap:
min_pass_rate: 0.98
blocking: true
typo_introduction:
min_pass_rate: 0.90
blocking: false
case_change:
min_pass_rate: 0.98
blocking: true
fairness_gender:
min_pass_rate: 1.0
blocking: true # Zero tolerance
fairness_race:
min_pass_rate: 1.0
blocking: true # Zero tolerance
directionality:
negation:
min_pass_rate: 0.85
blocking: true
intensifier:
min_pass_rate: 0.90
blocking: true
mft:
basic_positive:
min_pass_rate: 1.0
blocking: true
basic_negative:
min_pass_rate: 1.0
blocking: true
negation_handling:
min_pass_rate: 0.85
blocking: true
19.2.6. Summary Checklist
| Test Type | Purpose | Expected Pass Rate | Blocking? |
|---|---|---|---|
| MFT - Basic | Sanity checks | 100% | ✅ Yes |
| MFT - Complex | Advanced capabilities | 85%+ | ⚠️ Depends |
| Invariance - Neutral | Filler words, typos | 95%+ | ✅ Yes |
| Invariance - Fairness | Protected attributes | 100% | ✅ Yes |
| Directionality - Core | Negation, intensifiers | 85%+ | ✅ Yes |
Behavioral testing fundamentally changes how we think about model evaluation. It forces us to move beyond a myopic focus on a single accuracy number and instead adopt a more holistic, capability-oriented view of model quality.
[End of Section 19.2]