32.3. PII Redaction: The First Line of Defense

Warning

Zero Trust Data: Assume all uncontrolled text data contains Personally Identifiable Information (PII) until proven otherwise. Training a Large Language Model (LLM) on unredacted customer support logs is the fastest way to leak private data and incur GDPR/CCPA fines.

Machine Learning models, especially LLMs, have a nasty habit of memorizing their training data. If you train on a dataset containing My name is Alice and my SSN is 123-45..., the model might faithfully autocomplete that sequence for a stranger.

PII Redaction is not just compliance; it is a security control. We must sanitize data before it enters the training environment (the Feature Store or Data Lake).

32.3.1. The Taxonomy of De-Identification

Privacy is not binary. There are levels of sanitization, each with a utility trade-off.

Technique	Method	Example Input	Example Output	Pros	Cons
Redaction	Masking	“Call Alice at 555-0199”	“Call [NAME] at [PHONE]”	100% Secure.	Destroys semantic context for the model.
Anonymization	Generalization	“Age: 24, Zip: 90210”	“Age: 20-30, Zip: 902xx”	Statistically useful (k-anonymity).	Can be prone to re-identification attacks.
Pseudonymization	Tokenization	“User: Alice”	“User: user_8f9a2b”	Preserves relationships (Alice is always user_8f9a2b).	Requires a secure lookup table (the “Linkability” risk).
Synthetic Replacement	Faking	“Alice lives in NY”	“Jane lives in Seattle”	Preserves full semantic structure.	Difficult to do consistently without breaking context.

32.3.2. Microsoft Presidio (Open Source)

Microsoft Presidio is the industry standard open-source library for PII detection and redaction. It uses a combination of Named Entity Recognition (NER) models and Regex logic.

Architecture

Analyzer: Detects PII entities (CREDIT_CARD, PERSON, PHONE_NUMBER).
Anonymizer: Replaces detected entities with desired operators (mask, replace, hash).

Implementation: The `PIIStripper` Class

Here is a production-hardened Python class for integrating Presidio into your ETL pipelines (e.g., PySpark or Ray).

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig

class PIIStripper:
    def __init__(self):
        # Initialize engines once (expensive operation loading NLP models)
        self.analyzer = AnalyzerEngine()
        self.anonymizer = AnonymizerEngine()
        
    def sanitize_text(self, text: str, mode: str = "mask") -> str:
        """
        Sanitizes text by removing PII.
        modes:
          - mask: Replaces PII with <ENTITY_TYPE>
          - hash: Replaces PII with a hash (for consistent linkage)
        """
        if not text:
            return ""

        # 1. Analyze (Detect)
        results = self.analyzer.analyze(
            text=text,
            entities=["PHONE_NUMBER", "CREDIT_CARD", "EMAIL_ADDRESS", "PERSON", "US_SSN"],
            language='en'
        )

        # 2. Define Operators based on mode
        operators = {}
        if mode == "mask":
            # Replace with <ENTITY>
            for entity in ["PHONE_NUMBER", "CREDIT_CARD", "EMAIL_ADDRESS", "PERSON", "US_SSN"]:
                operators[entity] = OperatorConfig("replace", {"new_value": f"<{entity}>"})
        elif mode == "hash":
             # Hash implementation (custom lambda usually required or specialized operator)
             # Presidio supports custom operators, omitted for brevity
             pass

        # 3. Anonymize (Redact)
        anonymized_result = self.anonymizer.anonymize(
            text=text,
            analyzer_results=results,
            operators=operators
        )

        return anonymized_result.text

# Usage
stripper = PIIStripper()
raw_log = "Error: Payment failed for user John Doe (CC: 4532-xxxx-xxxx-1234) at 555-1234."
clean_log = stripper.sanitize_text(raw_log)
print(clean_log)
# Output: "Error: Payment failed for user <PERSON> (CC: <CREDIT_CARD>) at <PHONE_NUMBER>."

Scaling with Spark

Presidio is Python-based and can be slow. To run it at petabyte scale:

Broadcast the AnalyzerEngine model weights (~500MB) to all executers.
Use mapPartitions to instantiate the engine once per partition, not per row.
Use Pandas UDFs (Arrow) for vectorization where possible.

32.3.3. Cloud Native Solutions

If you don’t want to manage NLP models, use the cloud APIs. They are more accurate but cost money per character.

1. Google Cloud Data Loss Prevention (DLP)

Cloud DLP is extremely powerful because it integrates directly with BigQuery and Google Cloud Storage.

Inspection Job (Terraform): You can set up a “Trigger” that automatically scans new files in a bucket.

resource "google_data_loss_prevention_job_trigger" "scan_training_data" {
  parent = "projects/my-project"
  description = "Scan incoming CSVs for PII"
  
  triggers {
    schedule {
      recurrence_period_duration = "86400s" # Daily
    }
  }
  
  inspect_job {
    storage_config {
      cloud_storage_options {
        file_set {
          url = "gs://my-training-data-landing/"
        }
      }
    }
    
    inspect_config {
      info_types { name = "EMAIL_ADDRESS" }
      info_types { name = "CREDIT_CARD_NUMBER" }
      info_types { name = "US_SOCIAL_SECURITY_NUMBER" }
      min_likelihood = "LIKELY"
    }
    
    actions {
      save_findings {
        output_config {
          table {
            project_id = "my-project"
            dataset_id = "compliance_logs"
            table_id   = "dlp_findings"
          }
        }
      }
    }
  }
}

De-identification Template: GCP allows you to define a “Template” that transforms data. You can apply this when moving data from landing to clean.

2. AWS Macie vs. Glue DataBrew

Amazon Macie: Primarily for S3 security (finding buckets that contain PII). It scans and alarms but doesn’t natively “rewrite” the file to redact it on the fly.
AWS Glue DataBrew: A visual data prep tool that has built-in PII redaction transformations.
AWS Comprehend: Can detect PII entities in text documents, which you can then redact.

32.3.4. Handling “Quasi-Identifiers” (The Linkage Attack)

Redacting obviously private fields (Name, SSN) is easy. The hard part is Quasi-Identifiers.

Example: {Zip Code, Gender, Date of Birth}.
Fact: 87% of the US population can be uniquely identified by just these three fields.

k-Anonymity: A dataset satisfies k-anonymity if every record is indistinguishable from at least $k-1$ other records. To achieve this in MLOps:

Generalize: Convert exact Age (34) to Age Range (30-40).
Suppress: Drop the Zip Code entirely or keep only the first 3 digits.

32.3.5. LLM-Specific Challenges: The “Context” Problem

In RAG (Retrieval Augmented Generation), you have a new problem. You might retrieve a document that is safe in isolation, but when combined with the user’s prompt, reveals PII.

The “Canary” Token Strategy: Inject fake PII (Canary tokens) into your training data and vector database.

Store Alice's SSN is 000-00-0000 (Fake).
Monitor your LLM outputs. If it ever outputs 000-00-0000, you know your model is regurgitating training data verbatim and you have a leakage problem.

32.3.6. Summary for Engineers

Automate detection: Use Presidio (Code) or Cloud DLP (Infra) to scan every dataset before it touches the Feature Store.
Separate Bronze/Silver/Gold:
- Bronze: Raw data (Locked down, strictly limited access).
- Silver: Redacted data (Available to Data Scientists).
- Gold: Aggregated features (High performance).
Audit the Redactor: The redaction model itself is an ML model. It has False Negatives. You must periodically human-review a sample of “Redacted” data to ensure it isn’t leaking.

[Previous content preserved…]

32.3.7. Deep Dive: Format Preserving Encryption (FPE)

Sometimes “masking” (<PHONE>) breaks your application validation logic. If your downstream system expects a 10-digit number and gets a string <PHONE>, it crashes. Format Preserving Encryption (FPE) encrypts data while keeping the original format (e.g., a credit card number is encrypted into another valid-looking credit card number).

Algorithm: FF3-1 (NIST recommended).

Python Implementation (using pyffx):

import pyffx

# The key must be kept in a secure KMS
secret_key = b'secret-key-12345' 

def encrypt_ssn(ssn: str) -> str:
    # SSN Format: 9 digits. 
    # We encrypt the digits only, preserving hyphens if needed by app logic
    digits = ssn.replace("-", "")
    
    e = pyffx.Integer(secret_key, length=9)
    encrypted_int = e.encrypt(int(digits))
    
    # Pad back to 9 chars
    encrypted_str = str(encrypted_int).zfill(9)
    
    # Re-assemble
    return f"{encrypted_str[:3]}-{encrypted_str[3:5]}-{encrypted_str[5:]}"

def decrypt_ssn(encrypted_ssn: str) -> str:
    digits = encrypted_ssn.replace("-", "")
    e = pyffx.Integer(secret_key, length=9)
    decrypted_int = e.decrypt(int(digits))
    decrypted_str = str(decrypted_int).zfill(9)
    return f"{decrypted_str[:3]}-{decrypted_str[3:5]}-{decrypted_str[5:]}"

# Usage
original = "123-45-6789"
masked = encrypt_ssn(original)
print(f"Original: {original} -> Masked: {masked}")
# Output: Original: 123-45-6789 -> Masked: 982-11-4321
# The masked output LOOKS like a real SSN but is cryptographically secure.

Use Case: This is perfect for “Silver” datasets used by Data Scientists who need to join tables on SSN but strictly do not need to know the real SSN.

32.3.8. Differential Privacy (DP) for Training

Redaction protects individual fields. Differential Privacy (DP) protects the statistical influence of an individual on the model weights. If Alice is in the training set, the model should behave exactly the same (statistically) as if she were not.

Technique: DP-SGD (Differentially Private Stochastic Gradient Descent). It adds noise to the gradients during backpropagation.

Implementation with Opacus (PyTorch):

import torch
from opacus import PrivacyEngine

# Standard PyTorch Model
model = torch.nn.Linear(10, 2)
optimizer = torch.optim.SGD(model.parameters(), lr=0.05)
data_loader = ... # Your sensitive data

# Wrap with PrivacyEngine
privacy_engine = PrivacyEngine()

model, optimizer, data_loader = privacy_engine.make_private(
    module=model,
    optimizer=optimizer,
    data_loader=data_loader,
    noise_multiplier=1.1, # The amount of noise (Lambda)
    max_grad_norm=1.0,    # Clipping gradients
)

# Train loop (UNCHANGED)
for x, y in data_loader:
    optimizer.zero_grad()
    loss = criterion(model(x), y)
    loss.backward()
    optimizer.step()

# Check privacy budget
epsilon = privacy_engine.get_epsilon(delta=1e-5)
print(f"Privacy Guarantee: (ε = {epsilon:.2f}, δ = 1e-5)")

Trade-off: DP models always have lower accuracy (Utility) than non-DP models. The noise hurts convergence. You must graph the “Privacy-Utility Frontier” for your stakeholders.

If a user says “Delete me,” you must delete them from:

The Database (Easy).
The Data Lake Backups (Hard).
The Machine Learning Model (Impossible?).

The Machine Unlearning Problem: You cannot easily “delete” a user from a Neural Network’s weights. Current State of the Art Solution: SISA (Sharded, Isolated, Sliced, Aggregated) Training.

Process:

Shard: Split your training data into 10 independent shards ($S_1 … S_{10}$).
Train: Train 10 separate “Constituent Models” ($M_1 … M_{10}$).
Serve: Aggregated prediction (Voting) of $M_1…M_{10}$.
Delete: When Alice (who is in Shard $S_3$) requests deletion:
- You remove Alice from $S_3$.
- You Retrain only $M_3$ (1/10th of the cost).
- $M_1, M_2, M_4…$ are untouched.

This reduces the retraining cost by 10x, making “compliance retraining” economically feasible.

graph TD
    Data[Full Dataset] --> S1[Shard 1]
    Data --> S2[Shard 2]
    Data --> S3[Shard 3]
    
    S1 --> M1[Model 1]
    S2 --> M2[Model 2]
    S3 --> M3[Model 3]
    
    M1 --> Vote{Voting Mechanism}
    M2 --> Vote
    M3 --> Vote
    Vote --> Ans[Prediction]
    
    User[Alice requests delete] -->|Located in Shard 2| S2
    S2 -->|Retrain| M2

32.3.10. Handling Unstructured Audio/Image PII

Redacting text is solved. Redacting audio is hard. If a user says “My name is Alice” in a customer service call recording, you must beep it out or silence it.

Architecture: Use OpenAI Whisper (for transcription) + Presidio (for extraction) + FFmpeg (for silencing).

import whisper
from pydub import AudioSegment

def redact_audio(audio_path):
    model = whisper.load_model("base")
    result = model.transcribe(audio_path, word_timestamps=True)
    
    audio = AudioSegment.from_wav(audio_path)
    
    for segment in result['segments']:
        text = segment['text']
        # Use Presidio here to check if 'text' contains PII
        if parse_presidio(text) == "PERSON":
            start_ms = segment['start'] * 1000
            end_ms = segment['end'] * 1000
            
            # Silence this segment
            silence = AudioSegment.silent(duration=end_ms - start_ms)
            audio = audio.overlay(silence, position=start_ms)
            
    audio.export("redacted.wav", format="wav")

32.3.11. Secure Multi-Party Computation (SMPC)

What if two banks want to train a fraud model together, but cannot share customer data? SMPC allows computing a function $f(x, y)$ where Party A holds $x$, Party B holds $y$, and neither learns the other’s input.

PySyft: A Python library for SMPC and Federated Learning. It allows “Remote Data Science.” You send the code to the data owner. The code runs on their machine. Only the result comes back.

32.3.12. Summary Checklist for Privacy Engineering

Inventory: Do you know where all PII is? (Use Macie/DLP).
Sanitize: Do you strip PII before it hits the Lake? (Use Presidio/FPE).
Minimize: Do you use DP-SGD for sensitive models? (Use Opacus).
Forget: Do you have a SISA architecture or a “Retrain-from-scratch” SLA for GDPR deletion requests?

[End of Section 32.3]

Keyboard shortcuts

The MLOps Omni-Reference