Keyboard shortcuts

Press or to navigate between chapters

Press ? to show this help

Press Esc to hide this help

Chapter 19.2: Behavioral Testing: Invariance, Directionality, and Minimum Functionality Tests (MFTs)

In the previous section, we discussed the pyramid of ML testing, which provides a structural overview of what to test. In this section, we dive deep into the how of model quality testing, moving beyond aggregate performance metrics to a more nuanced and granular evaluation of model behavior.

A model that achieves 99% accuracy can still harbor critical, systematic failures for specific subpopulations of data or fail in predictable and embarrassing ways. A high F1-score tells you that your model is performing well on average on your test set, but it doesn’t tell you how it’s achieving that performance or what its conceptual understanding of the problem is.

This is the domain of Behavioral Testing. Inspired by the principles of unit testing in traditional software engineering, behavioral testing evaluates a model’s capabilities on specific, well-defined phenomena.

Important

Instead of asking, “How accurate is the model?”, we ask, “Can the model handle negation?”, “Is the model invariant to changes in gendered pronouns?”, or “Does the loan approval score monotonically increase with income?”


19.2.1. The Behavioral Testing Framework

graph TB
    subgraph "Traditional Testing"
        A[Test Dataset] --> B[Model]
        B --> C[Aggregate Metrics]
        C --> D[Accuracy: 94%]
    end
    
    subgraph "Behavioral Testing"
        E[Capability Tests] --> F[Model]
        F --> G{Pass/Fail per Test}
        G --> H[Invariance: 98%]
        G --> I[Directionality: 87%]
        G --> J[MFT Negation: 62%]
    end

Why Behavioral Testing Matters

ApproachWhat It MeasuresBlind Spots
Traditional MetricsAverage performance on held-out dataFailure modes on edge cases
Slice AnalysisPerformance on subgroupsDoesn’t test causal understanding
Behavioral TestingSpecific capability adherenceRequires human-defined test cases

The CheckList Framework

The seminal paper “Beyond Accuracy: Behavioral Testing of NLP Models with CheckList” (Ribeiro et al., 2020) introduced this framework:

  1. Minimum Functionality Tests (MFTs): Simple sanity checks that any model should pass
  2. Invariance Tests (INV): Model output shouldn’t change for semantically-equivalent inputs
  3. Directional Expectation Tests (DIR): Model output should change predictably for certain input changes

19.2.2. Invariance Tests: Testing for What Shouldn’t Matter

The core principle of an invariance test is simple: a model’s prediction should be robust to changes in the input that do not alter the fundamental meaning or context of the data point.

If a sentiment classifier labels “The food was fantastic” as positive, it should also label “The food was fantastic, by the way” as positive.

Common Invariance Categories

graph LR
    subgraph "NLP Invariances"
        A[Entity Swapping]
        B[Typo Introduction]
        C[Case Changes]
        D[Neutral Fillers]
        E[Paraphrasing]
    end
    
    subgraph "CV Invariances"
        F[Brightness/Contrast]
        G[Small Rotation]
        H[Minor Cropping]
        I[Background Changes]
    end
    
    subgraph "Tabular Invariances"
        J[Feature Order]
        K[Irrelevant Perturbation]
        L[Unit Conversions]
    end

NLP Invariance Examples

Invariance TypeOriginalPerturbedExpected
Entity Swap“Alice loved the movie”“Bob loved the movie”Same prediction
Typo“The service was excellent”“The servise was excelent”Same prediction
Case“GREAT PRODUCT”“great product”Same prediction
Filler“Good food”“Good food, you know”Same prediction
Synonym“The film was boring”“The movie was boring”Same prediction

Computer Vision Invariance Examples

Invariance TypeTransformationBounds
RotationRandom rotation±5 degrees
BrightnessBrightness adjustment±10%
CropEdge cropping≤5% of image
BlurGaussian blurσ ≤ 0.5
NoiseSalt-and-pepper≤1% of pixels

Implementation: Invariance Testing Framework

# invariance_testing.py - Production-ready invariance testing

from dataclasses import dataclass, field
from typing import Callable, List, Dict, Any, Tuple
from abc import ABC, abstractmethod
import numpy as np
from enum import Enum
import random
import re


class InvarianceType(Enum):
    ENTITY_SWAP = "entity_swap"
    TYPO = "typo_introduction"
    CASE_CHANGE = "case_change"
    FILLER_WORDS = "filler_words"
    SYNONYM = "synonym_replacement"
    PARAPHRASE = "paraphrase"


@dataclass
class InvarianceTestResult:
    """Result of a single invariance test."""
    original_input: Any
    perturbed_input: Any
    original_prediction: Any
    perturbed_prediction: Any
    invariance_type: InvarianceType
    passed: bool
    delta: float = 0.0
    
    def __str__(self):
        status = "✅ PASS" if self.passed else "❌ FAIL"
        return f"{status} | {self.invariance_type.value} | Δ={self.delta:.4f}"


@dataclass
class InvarianceTestSuite:
    """A suite of invariance tests for a specific capability."""
    name: str
    description: str
    invariance_type: InvarianceType
    perturbation_fn: Callable[[str], str]
    test_cases: List[str] = field(default_factory=list)
    tolerance: float = 0.01  # Maximum allowed change in prediction
    

class TextPerturbations:
    """Library of text perturbation functions."""
    
    # Common first names for entity swapping
    FIRST_NAMES = [
        "Alice", "Bob", "Charlie", "Diana", "Eve", "Frank",
        "Grace", "Henry", "Iris", "Jack", "Kate", "Liam",
        "Maria", "Noah", "Olivia", "Peter", "Quinn", "Rose"
    ]
    
    # Common typo patterns
    TYPO_CHARS = {
        'a': ['s', 'q', 'z'],
        'e': ['w', 'r', 'd'],
        'i': ['u', 'o', 'k'],
        'o': ['i', 'p', 'l'],
        'u': ['y', 'i', 'j']
    }
    
    # Neutral filler phrases
    FILLERS = [
        ", you know",
        ", honestly",
        ", to be fair",
        " basically",
        " actually",
        ", in my opinion"
    ]
    
    @classmethod
    def swap_entity(cls, text: str) -> str:
        """Swap named entities with random alternatives."""
        result = text
        for name in cls.FIRST_NAMES:
            if name in result:
                replacement = random.choice([n for n in cls.FIRST_NAMES if n != name])
                result = result.replace(name, replacement)
                break
        return result
    
    @classmethod
    def swap_entity_by_gender(cls, text: str) -> Tuple[str, str]:
        """Swap gendered names and return both versions."""
        male_names = ["James", "John", "Michael", "David"]
        female_names = ["Mary", "Sarah", "Jennifer", "Emily"]
        
        male_version = text
        female_version = text
        
        for m, f in zip(male_names, female_names):
            if m in text:
                male_version = text.replace(m, m)  # Keep as is
                female_version = text.replace(m, f)
                break
            if f in text:
                male_version = text.replace(f, m)
                female_version = text.replace(f, f)
                break
        
        return male_version, female_version
    
    @classmethod
    def introduce_typo(cls, text: str, probability: float = 0.1) -> str:
        """Introduce realistic typos."""
        words = text.split()
        result = []
        
        for word in words:
            if len(word) > 3 and random.random() < probability:
                # Pick a random character to modify
                idx = random.randint(1, len(word) - 2)
                char = word[idx].lower()
                
                if char in cls.TYPO_CHARS:
                    replacement = random.choice(cls.TYPO_CHARS[char])
                    word = word[:idx] + replacement + word[idx+1:]
            
            result.append(word)
        
        return " ".join(result)
    
    @classmethod
    def change_case(cls, text: str) -> str:
        """Change case in various ways."""
        strategies = [
            str.lower,
            str.upper,
            str.title,
            lambda x: x.swapcase()
        ]
        return random.choice(strategies)(text)
    
    @classmethod
    def add_filler(cls, text: str) -> str:
        """Add neutral filler words/phrases."""
        filler = random.choice(cls.FILLERS)
        
        # Insert at sentence boundary or end
        if "." in text:
            parts = text.rsplit(".", 1)
            return f"{parts[0]}{filler}.{parts[1] if len(parts) > 1 else ''}"
        else:
            return text + filler


class InvarianceTester:
    """
    Run invariance tests against an ML model.
    
    Ensures model predictions are stable under semantically-equivalent transformations.
    """
    
    def __init__(
        self,
        model: Any,
        predict_fn: Callable[[Any, Any], float],
        tolerance: float = 0.01
    ):
        """
        Args:
            model: The ML model to test
            predict_fn: Function that takes (model, input) and returns prediction score
            tolerance: Maximum allowed change in prediction for invariance to pass
        """
        self.model = model
        self.predict_fn = predict_fn
        self.tolerance = tolerance
        self.results: List[InvarianceTestResult] = []
    
    def test_invariance(
        self,
        original: str,
        perturbed: str,
        invariance_type: InvarianceType
    ) -> InvarianceTestResult:
        """Test invariance for a single input pair."""
        
        original_pred = self.predict_fn(self.model, original)
        perturbed_pred = self.predict_fn(self.model, perturbed)
        
        delta = abs(original_pred - perturbed_pred)
        passed = delta <= self.tolerance
        
        result = InvarianceTestResult(
            original_input=original,
            perturbed_input=perturbed,
            original_prediction=original_pred,
            perturbed_prediction=perturbed_pred,
            invariance_type=invariance_type,
            passed=passed,
            delta=delta
        )
        
        self.results.append(result)
        return result
    
    def run_suite(
        self,
        test_cases: List[str],
        perturbation_fn: Callable[[str], str],
        invariance_type: InvarianceType
    ) -> Dict[str, Any]:
        """Run a full invariance test suite."""
        
        suite_results = []
        
        for original in test_cases:
            perturbed = perturbation_fn(original)
            result = self.test_invariance(original, perturbed, invariance_type)
            suite_results.append(result)
        
        # Calculate aggregate metrics
        passed = sum(1 for r in suite_results if r.passed)
        total = len(suite_results)
        pass_rate = passed / total if total > 0 else 0
        
        return {
            "invariance_type": invariance_type.value,
            "total_tests": total,
            "passed": passed,
            "failed": total - passed,
            "pass_rate": pass_rate,
            "mean_delta": np.mean([r.delta for r in suite_results]),
            "max_delta": max(r.delta for r in suite_results) if suite_results else 0,
            "results": suite_results
        }
    
    def run_all_invariances(
        self,
        test_cases: List[str]
    ) -> Dict[str, Dict]:
        """Run all standard invariance tests."""
        
        suites = {
            InvarianceType.ENTITY_SWAP: TextPerturbations.swap_entity,
            InvarianceType.TYPO: TextPerturbations.introduce_typo,
            InvarianceType.CASE_CHANGE: TextPerturbations.change_case,
            InvarianceType.FILLER_WORDS: TextPerturbations.add_filler,
        }
        
        results = {}
        
        for inv_type, perturbation_fn in suites.items():
            results[inv_type.value] = self.run_suite(
                test_cases, perturbation_fn, inv_type
            )
        
        return results
    
    def generate_report(self, results: Dict[str, Dict]) -> str:
        """Generate markdown report."""
        
        report = "# Invariance Test Report\n\n"
        report += "| Test Type | Pass Rate | Mean Δ | Max Δ | Status |\n"
        report += "|:----------|:----------|:-------|:------|:-------|\n"
        
        all_passed = True
        
        for test_type, data in results.items():
            pass_rate = data["pass_rate"]
            status = "✅" if pass_rate >= 0.95 else ("⚠️" if pass_rate >= 0.80 else "❌")
            if pass_rate < 0.95:
                all_passed = False
            
            report += (
                f"| {test_type} | {pass_rate:.1%} | "
                f"{data['mean_delta']:.4f} | {data['max_delta']:.4f} | {status} |\n"
            )
        
        overall = "✅ PASSED" if all_passed else "❌ FAILED"
        report += f"\n**Overall Status**: {overall}\n"
        
        return report


# Example usage
def example_predict(model, text: str) -> float:
    """Example prediction function."""
    # In production, this would call your actual model
    return 0.85


# Test execution
tester = InvarianceTester(
    model=None,  # Your model
    predict_fn=example_predict,
    tolerance=0.05
)

test_cases = [
    "The customer service was excellent today.",
    "Alice really enjoyed the new restaurant.",
    "The product quality exceeded my expectations.",
]

results = tester.run_all_invariances(test_cases)
print(tester.generate_report(results))

Fairness-Critical Invariance Tests

# fairness_invariance.py - Testing for demographic fairness

class FairnessInvarianceTester:
    """
    Test model invariance to protected attributes.
    
    These tests are CRITICAL for compliance with ECOA (lending),
    Title VII (employment), and general fairness principles.
    """
    
    PROTECTED_SWAPS = {
        "gender": {
            "male": ["James", "John", "Michael", "Robert", "William"],
            "female": ["Mary", "Jennifer", "Linda", "Patricia", "Elizabeth"],
            "pronouns_male": ["he", "him", "his"],
            "pronouns_female": ["she", "her", "hers"]
        },
        "race": {
            # Names associated with different demographic groups
            # Based on Caliskan et al. audit methodology
            "group_a": ["Emily", "Greg", "Meredith", "Brad"],
            "group_b": ["Lakisha", "Jamal", "Tamika", "Darnell"]
        },
        "age": {
            "young": ["young professional", "recent graduate", "millennial"],
            "old": ["senior professional", "experienced veteran", "seasoned"]
        }
    }
    
    def __init__(self, model, predict_fn, tolerance: float = 0.01):
        self.model = model
        self.predict_fn = predict_fn
        self.tolerance = tolerance
    
    def test_gender_invariance(
        self,
        templates: List[str]
    ) -> Dict:
        """
        Test invariance to gender-coded names.
        
        Example template: "The candidate {name} applied for the position."
        """
        results = []
        
        male_names = self.PROTECTED_SWAPS["gender"]["male"]
        female_names = self.PROTECTED_SWAPS["gender"]["female"]
        
        for template in templates:
            if "{name}" not in template:
                continue
            
            # Test with male names
            male_scores = [
                self.predict_fn(self.model, template.format(name=name))
                for name in male_names[:3]
            ]
            
            # Test with female names
            female_scores = [
                self.predict_fn(self.model, template.format(name=name))
                for name in female_names[:3]
            ]
            
            # Statistical comparison
            male_mean = np.mean(male_scores)
            female_mean = np.mean(female_scores)
            delta = abs(male_mean - female_mean)
            
            results.append({
                "template": template,
                "male_mean": male_mean,
                "female_mean": female_mean,
                "delta": delta,
                "passed": delta <= self.tolerance
            })
        
        # Aggregate
        passed = sum(1 for r in results if r["passed"])
        
        return {
            "test_type": "gender_invariance",
            "total": len(results),
            "passed": passed,
            "failed": len(results) - passed,
            "pass_rate": passed / len(results) if results else 0,
            "details": results
        }
    
    def test_racial_invariance(
        self,
        templates: List[str]
    ) -> Dict:
        """Test invariance to racially-associated names."""
        # Similar implementation to gender_invariance
        pass
    
    def generate_fairness_report(self, results: Dict) -> str:
        """Generate compliance-ready fairness report."""
        
        report = """
# Model Fairness Invariance Report

## Summary

This report evaluates model predictions for invariance to protected attributes
as required by fair lending (ECOA), fair employment (Title VII), and
AI governance best practices.

## Results

| Protected Attribute | Pass Rate | Max Gap | Status |
|:--------------------|:----------|:--------|:-------|
"""
        
        for attr, data in results.items():
            pass_rate = data["pass_rate"]
            max_gap = max(r["delta"] for r in data["details"]) if data["details"] else 0
            status = "✅ COMPLIANT" if pass_rate == 1.0 else "⚠️ REVIEW REQUIRED"
            
            report += f"| {attr} | {pass_rate:.1%} | {max_gap:.4f} | {status} |\n"
        
        report += """
## Methodology

Testing follows the methodology established in:
- Ribeiro et al. (2020) "Beyond Accuracy: Behavioral Testing of NLP Models"
- Caliskan et al. (2017) "Semantics derived automatically from language corpora"

## Compliance Notes

Failure of these tests may indicate discriminatory behavior and should be
investigated before production deployment.
"""
        
        return report

19.2.3. Directionality Tests: Testing for What Should Matter

While invariance tests check for robustness to irrelevant changes, directionality tests verify that the model’s output changes in an expected direction when a meaningful change is made to the input.

Common Directionality Patterns

DomainInput ChangeExpected Output Change
SentimentAdd intensifier (“good” → “very good”)Score increases
SentimentAdd negation (“good” → “not good”)Score decreases
CreditIncrease incomeApproval probability increases
ChurnIncrease support ticketsChurn probability increases
Object DetectionIncrease object sizeConfidence increases

Implementation: Directionality Testing

# directionality_testing.py

from dataclasses import dataclass
from typing import Callable, List, Dict, Any, Tuple
from enum import Enum


class DirectionType(Enum):
    INCREASE = "should_increase"
    DECREASE = "should_decrease"
    FLIP = "should_flip_class"


@dataclass
class DirectionalityTest:
    """A single directionality test case."""
    original: Any
    modified: Any
    expected_direction: DirectionType
    description: str


@dataclass
class DirectionalityResult:
    """Result of a directionality test."""
    test_case: DirectionalityTest
    original_score: float
    modified_score: float
    passed: bool
    actual_delta: float


class SentimentDirectionalityTests:
    """Pre-built directionality tests for sentiment analysis."""
    
    @staticmethod
    def intensifier_tests() -> List[DirectionalityTest]:
        """Adding intensifiers should increase sentiment magnitude."""
        return [
            DirectionalityTest(
                original="The movie was good.",
                modified="The movie was very good.",
                expected_direction=DirectionType.INCREASE,
                description="Intensifier 'very' should increase positive sentiment"
            ),
            DirectionalityTest(
                original="The service was helpful.",
                modified="The service was extremely helpful.",
                expected_direction=DirectionType.INCREASE,
                description="Intensifier 'extremely' should increase positive sentiment"
            ),
            DirectionalityTest(
                original="The food was bad.",
                modified="The food was terrible.",
                expected_direction=DirectionType.DECREASE,
                description="Stronger negative word should decrease sentiment"
            ),
        ]
    
    @staticmethod
    def negation_tests() -> List[DirectionalityTest]:
        """Adding negation should reverse sentiment direction."""
        return [
            DirectionalityTest(
                original="I love this product.",
                modified="I do not love this product.",
                expected_direction=DirectionType.DECREASE,
                description="Negation should reverse positive sentiment"
            ),
            DirectionalityTest(
                original="The weather is beautiful.",
                modified="The weather is not beautiful.",
                expected_direction=DirectionType.DECREASE,
                description="Negation should reverse positive sentiment"
            ),
            DirectionalityTest(
                original="I hate waiting in line.",
                modified="I don't hate waiting in line.",
                expected_direction=DirectionType.INCREASE,
                description="Negation should reverse negative sentiment"
            ),
        ]
    
    @staticmethod
    def comparative_tests() -> List[DirectionalityTest]:
        """Comparatives should show relative sentiment."""
        return [
            DirectionalityTest(
                original="This restaurant is good.",
                modified="This restaurant is better than average.",
                expected_direction=DirectionType.INCREASE,
                description="Positive comparative should increase sentiment"
            ),
            DirectionalityTest(
                original="The quality is acceptable.",
                modified="The quality is worse than expected.",
                expected_direction=DirectionType.DECREASE,
                description="Negative comparative should decrease sentiment"
            ),
        ]


class TabularDirectionalityTests:
    """Directionality tests for tabular ML models."""
    
    @staticmethod
    def credit_approval_tests(base_applicant: Dict) -> List[DirectionalityTest]:
        """
        Credit approval should have monotonic relationships with key features.
        """
        tests = []
        
        # Income increase
        higher_income = base_applicant.copy()
        higher_income["annual_income"] = base_applicant["annual_income"] * 1.5
        
        tests.append(DirectionalityTest(
            original=base_applicant,
            modified=higher_income,
            expected_direction=DirectionType.INCREASE,
            description="Higher income should increase approval probability"
        ))
        
        # Credit score increase
        higher_credit = base_applicant.copy()
        higher_credit["credit_score"] = min(850, base_applicant["credit_score"] + 50)
        
        tests.append(DirectionalityTest(
            original=base_applicant,
            modified=higher_credit,
            expected_direction=DirectionType.INCREASE,
            description="Higher credit score should increase approval probability"
        ))
        
        # Debt-to-income decrease (improvement)
        lower_dti = base_applicant.copy()
        lower_dti["debt_to_income"] = base_applicant["debt_to_income"] * 0.7
        
        tests.append(DirectionalityTest(
            original=base_applicant,
            modified=lower_dti,
            expected_direction=DirectionType.INCREASE,
            description="Lower DTI should increase approval probability"
        ))
        
        return tests


class DirectionalityTester:
    """
    Run directionality tests to verify model behaves as expected.
    """
    
    def __init__(
        self,
        model: Any,
        predict_fn: Callable[[Any, Any], float],
        min_delta: float = 0.05  # Minimum expected change
    ):
        self.model = model
        self.predict_fn = predict_fn
        self.min_delta = min_delta
    
    def run_test(
        self,
        test: DirectionalityTest
    ) -> DirectionalityResult:
        """Run a single directionality test."""
        
        original_score = self.predict_fn(self.model, test.original)
        modified_score = self.predict_fn(self.model, test.modified)
        
        delta = modified_score - original_score
        
        # Check if direction matches expectation
        if test.expected_direction == DirectionType.INCREASE:
            passed = delta >= self.min_delta
        elif test.expected_direction == DirectionType.DECREASE:
            passed = delta <= -self.min_delta
        else:  # FLIP
            passed = abs(delta) >= 0.5  # Significant change
        
        return DirectionalityResult(
            test_case=test,
            original_score=original_score,
            modified_score=modified_score,
            passed=passed,
            actual_delta=delta
        )
    
    def run_suite(
        self,
        tests: List[DirectionalityTest]
    ) -> Dict[str, Any]:
        """Run a full suite of directionality tests."""
        
        results = [self.run_test(t) for t in tests]
        
        passed = sum(1 for r in results if r.passed)
        total = len(results)
        
        return {
            "total": total,
            "passed": passed,
            "failed": total - passed,
            "pass_rate": passed / total if total > 0 else 0,
            "results": results
        }
    
    def generate_report(self, results: Dict) -> str:
        """Generate directionality test report."""
        
        report = "# Directionality Test Report\n\n"
        report += f"**Pass Rate**: {results['pass_rate']:.1%} ({results['passed']}/{results['total']})\n\n"
        
        report += "## Detailed Results\n\n"
        report += "| Description | Expected | Actual Δ | Status |\n"
        report += "|:------------|:---------|:---------|:-------|\n"
        
        for r in results["results"]:
            expected = r.test_case.expected_direction.value
            status = "✅" if r.passed else "❌"
            report += f"| {r.test_case.description[:50]}... | {expected} | {r.actual_delta:+.4f} | {status} |\n"
        
        return report

19.2.4. Minimum Functionality Tests (MFTs): The Unit Tests of ML

Minimum Functionality Tests are the closest ML equivalent to traditional software unit tests. An MFT is a simple, targeted test case designed to check a very specific, atomic capability of the model.

The MFT Philosophy

graph TB
    A[Define Model Capabilities] --> B[Create Simple Test Cases]
    B --> C[Each Capability: 10-50 tests]
    C --> D[100% Expected Pass Rate]
    D --> E{All Pass?}
    E -->|Yes| F[Deploy]
    E -->|No| G[Debug Specific Capability]

Building an MFT Suite

# mft_suite.py - Minimum Functionality Test Framework

from dataclasses import dataclass, field
from typing import List, Dict, Any, Callable
import pytest


@dataclass
class MFTCapability:
    """A capability that the model should possess."""
    name: str
    description: str
    test_cases: List[Dict[str, Any]] = field(default_factory=list)
    expected_pass_rate: float = 1.0  # MFTs should have 100% pass rate


class SentimentMFTSuite:
    """Complete MFT suite for sentiment analysis."""
    
    @staticmethod
    def basic_positive() -> MFTCapability:
        """Model should correctly identify obviously positive text."""
        return MFTCapability(
            name="basic_positive",
            description="Identify clearly positive sentiment",
            test_cases=[
                {"input": "I love this!", "expected": "POSITIVE"},
                {"input": "This is amazing.", "expected": "POSITIVE"},
                {"input": "Absolutely wonderful experience.", "expected": "POSITIVE"},
                {"input": "Best purchase ever.", "expected": "POSITIVE"},
                {"input": "Highly recommend!", "expected": "POSITIVE"},
                {"input": "5 stars, perfect.", "expected": "POSITIVE"},
                {"input": "Exceeded all expectations.", "expected": "POSITIVE"},
                {"input": "Couldn't be happier.", "expected": "POSITIVE"},
                {"input": "Made my day!", "expected": "POSITIVE"},
                {"input": "A true masterpiece.", "expected": "POSITIVE"},
            ]
        )
    
    @staticmethod
    def basic_negative() -> MFTCapability:
        """Model should correctly identify obviously negative text."""
        return MFTCapability(
            name="basic_negative",
            description="Identify clearly negative sentiment",
            test_cases=[
                {"input": "I hate this!", "expected": "NEGATIVE"},
                {"input": "This is terrible.", "expected": "NEGATIVE"},
                {"input": "Absolutely awful experience.", "expected": "NEGATIVE"},
                {"input": "Worst purchase ever.", "expected": "NEGATIVE"},
                {"input": "Do not recommend.", "expected": "NEGATIVE"},
                {"input": "0 stars, horrible.", "expected": "NEGATIVE"},
                {"input": "Complete disappointment.", "expected": "NEGATIVE"},
                {"input": "Total waste of money.", "expected": "NEGATIVE"},
                {"input": "Ruined my day.", "expected": "NEGATIVE"},
                {"input": "An utter failure.", "expected": "NEGATIVE"},
            ]
        )
    
    @staticmethod
    def negation_handling() -> MFTCapability:
        """Model should correctly handle negation."""
        return MFTCapability(
            name="negation_handling",
            description="Correctly interpret negated sentiment",
            test_cases=[
                {"input": "This is not good.", "expected": "NEGATIVE"},
                {"input": "Not a bad product.", "expected": "POSITIVE"},
                {"input": "I don't like this.", "expected": "NEGATIVE"},
                {"input": "Not recommended at all.", "expected": "NEGATIVE"},
                {"input": "Nothing special about it.", "expected": "NEGATIVE"},
                {"input": "Can't complain.", "expected": "POSITIVE"},
                {"input": "Not happy with the service.", "expected": "NEGATIVE"},
                {"input": "This isn't what I expected.", "expected": "NEGATIVE"},
            ]
        )
    
    @staticmethod
    def neutral_detection() -> MFTCapability:
        """Model should correctly identify neutral/factual text."""
        return MFTCapability(
            name="neutral_detection",
            description="Identify neutral or factual statements",
            test_cases=[
                {"input": "The product is blue.", "expected": "NEUTRAL"},
                {"input": "It weighs 5 pounds.", "expected": "NEUTRAL"},
                {"input": "Ships from California.", "expected": "NEUTRAL"},
                {"input": "Made of plastic.", "expected": "NEUTRAL"},
                {"input": "Available in three sizes.", "expected": "NEUTRAL"},
                {"input": "Contains 50 pieces.", "expected": "NEUTRAL"},
            ]
        )
    
    @staticmethod
    def sarcasm_detection() -> MFTCapability:
        """Model should handle common sarcastic patterns."""
        return MFTCapability(
            name="sarcasm_detection",
            description="Correctly interpret sarcastic text",
            test_cases=[
                {"input": "Oh great, another delay.", "expected": "NEGATIVE"},
                {"input": "Yeah, because that's exactly what I needed.", "expected": "NEGATIVE"},
                {"input": "Just what I always wanted, more problems.", "expected": "NEGATIVE"},
                {"input": "Wow, thanks for nothing.", "expected": "NEGATIVE"},
            ],
            expected_pass_rate=0.75  # Sarcasm is hard
        )


class MFTRunner:
    """Run MFT suites and generate reports."""
    
    def __init__(
        self,
        model: Any,
        predict_label_fn: Callable[[Any, str], str]
    ):
        self.model = model
        self.predict_label_fn = predict_label_fn
        self.results = {}
    
    def run_capability(
        self,
        capability: MFTCapability
    ) -> Dict:
        """Run tests for a single capability."""
        
        results = []
        
        for test_case in capability.test_cases:
            predicted = self.predict_label_fn(self.model, test_case["input"])
            passed = predicted == test_case["expected"]
            
            results.append({
                "input": test_case["input"],
                "expected": test_case["expected"],
                "predicted": predicted,
                "passed": passed
            })
        
        num_passed = sum(1 for r in results if r["passed"])
        pass_rate = num_passed / len(results) if results else 0
        
        return {
            "capability": capability.name,
            "description": capability.description,
            "total": len(results),
            "passed": num_passed,
            "pass_rate": pass_rate,
            "meets_threshold": pass_rate >= capability.expected_pass_rate,
            "required_pass_rate": capability.expected_pass_rate,
            "details": results
        }
    
    def run_suite(
        self,
        capabilities: List[MFTCapability]
    ) -> Dict:
        """Run complete MFT suite."""
        
        capability_results = {}
        
        for cap in capabilities:
            capability_results[cap.name] = self.run_capability(cap)
        
        # Aggregate
        all_meet_threshold = all(
            r["meets_threshold"] for r in capability_results.values()
        )
        
        return {
            "overall_pass": all_meet_threshold,
            "capabilities": capability_results
        }
    
    def generate_pytest_file(
        self,
        capabilities: List[MFTCapability],
        output_path: str
    ):
        """Generate pytest test file from MFT suite."""
        
        code = '''
import pytest

# Auto-generated MFT test file

@pytest.fixture(scope="session")
def model():
    # Load your model here
    from my_model import load_model
    return load_model()

'''
        
        for cap in capabilities:
            test_cases = [
                (tc["input"], tc["expected"]) 
                for tc in cap.test_cases
            ]
            
            code += f'''
# {cap.description}
@pytest.mark.parametrize("text,expected", {test_cases})
def test_{cap.name}(model, text, expected):
    predicted = model.predict_label(text)
    assert predicted == expected, f"Input: {{text}}, Expected: {{expected}}, Got: {{predicted}}"

'''
        
        with open(output_path, 'w') as f:
            f.write(code)
        
        return output_path

19.2.5. Integrating Behavioral Tests into CI/CD

Pipeline Integration

# .github/workflows/ml-behavioral-tests.yml

name: ML Behavioral Tests

on:
  push:
    paths:
      - 'models/**'
      - 'tests/behavioral/**'
  pull_request:
    paths:
      - 'models/**'

jobs:
  behavioral-tests:
    runs-on: ubuntu-latest
    
    steps:
      - uses: actions/checkout@v4
      
      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      
      - name: Install Dependencies
        run: |
          pip install -r requirements-test.txt
      
      - name: Download Model Artifact
        run: |
          # Download from model registry
          mlflow artifacts download -r ${{ secrets.MLFLOW_RUN_ID }} -d ./model
      
      - name: Run Invariance Tests
        run: |
          pytest tests/behavioral/test_invariance.py \
            --junitxml=reports/invariance.xml \
            --html=reports/invariance.html
      
      - name: Run Directionality Tests
        run: |
          pytest tests/behavioral/test_directionality.py \
            --junitxml=reports/directionality.xml
      
      - name: Run MFT Suite
        run: |
          pytest tests/behavioral/test_mft.py \
            --junitxml=reports/mft.xml
      
      - name: Generate Combined Report
        run: |
          python scripts/combine_behavioral_reports.py \
            --output reports/behavioral_summary.json
      
      - name: Check Quality Gates
        run: |
          python scripts/check_quality_gates.py \
            --config quality_gates.yaml \
            --results reports/behavioral_summary.json
      
      - name: Upload Reports
        uses: actions/upload-artifact@v3
        with:
          name: behavioral-test-reports
          path: reports/

Quality Gate Configuration

# quality_gates.yaml

behavioral_tests:
  invariance:
    entity_swap:
      min_pass_rate: 0.98
      blocking: true
    typo_introduction:
      min_pass_rate: 0.90
      blocking: false
    case_change:
      min_pass_rate: 0.98
      blocking: true
    fairness_gender:
      min_pass_rate: 1.0
      blocking: true  # Zero tolerance
    fairness_race:
      min_pass_rate: 1.0
      blocking: true  # Zero tolerance
  
  directionality:
    negation:
      min_pass_rate: 0.85
      blocking: true
    intensifier:
      min_pass_rate: 0.90
      blocking: true
  
  mft:
    basic_positive:
      min_pass_rate: 1.0
      blocking: true
    basic_negative:
      min_pass_rate: 1.0
      blocking: true
    negation_handling:
      min_pass_rate: 0.85
      blocking: true

19.2.6. Summary Checklist

Test TypePurposeExpected Pass RateBlocking?
MFT - BasicSanity checks100%✅ Yes
MFT - ComplexAdvanced capabilities85%+⚠️ Depends
Invariance - NeutralFiller words, typos95%+✅ Yes
Invariance - FairnessProtected attributes100%✅ Yes
Directionality - CoreNegation, intensifiers85%+✅ Yes

Behavioral testing fundamentally changes how we think about model evaluation. It forces us to move beyond a myopic focus on a single accuracy number and instead adopt a more holistic, capability-oriented view of model quality.

[End of Section 19.2]