20.2 Fine-Tuning & PEFT: Customizing the Brain
Prompt Engineering (Chapter 20.1) is like hiring a really smart generalist (GPT-4) and giving them a long checklist of instructions every single time you talk to them. Fine-Tuning is like sending that person to Medical School. After training, you don’t need the checklist. They just know how to write the prescription.
In simple terms: Prompt Engineering puts knowledge in the Context. Fine-Tuning puts knowledge in the Weights.
1. When to Fine-Tune?
Do not fine-tune for “Knowledge”. Fine-tune for Form, Style, and Behavior.
- Bad Candidate: “Teach the model who won the Super Bowl last week.” (Use RAG).
- Good Candidate: “Teach the model to speak like a 17th-century Pirate.” (Style).
- Good Candidate: “Teach the model to output valid FHIR JSON for Electronic Health Records.” (Format).
- Good Candidate: “Reduce latency by removing the 50-page Instruction Manual from the prompt.” (Optimization).
1.1. The Cost Argument
Imagine you have a prompt with 2,000 tokens of “Few-Shot Examples” and instructions.
- Prompt approach: You pay for 2,000 input tokens every request.
- Fine-Tuned approach: You bake those 2,000 tokens into the weights. Your prompt becomes 50 tokens.
- Result: 40x cheaper inference and 50% lower latency (less time to process input).
2. The Math of Memory: Why is Full Training Hard?
Why can’t I just run model.fit() on Llama-2-7B on my laptop?
2.1. VRAM Calculation
A 7 Billion parameter model. Each parameter is a float16 (2 bytes).
- Model Weights: $7B \times 2 = 14$ GB.
To serve it (Inference), you need 14 GB. To train it (Backprop), you need much more:
- Gradients: Same size as weights (14 GB).
- Optimizer States: Adam keeps two states per parameter (Momentum and Variance), usually in float32.
- $7B \times 2 \text{ states} \times 4 \text{ bytes} = 56$ GB.
- Activations: The intermediate outputs of every layer (depends on batch size/sequence length). Could be 20-50 GB.
Total: > 100 GB VRAM. Hardware: An A100 (80GB) isn’t enough. You need multi-gpu (H100 Cluster). This is inaccessible for most MLOps teams.
3. PEFT: Parameter-Efficient Fine-Tuning
Enter PEFT. Instead of updating all 7B weights, we freeze them. We stick small “Adapter” modules in between the layers and only train those.
3.1. LoRA (Low-Rank Adaptation)
The Hypothesis (Hu et al., 2021) is that the “change” in weights ($\Delta W$) during adaptation has a Low Rank.
$$ W_{finetuned} = W_{frozen} + \Delta W $$ $$ \Delta W = B \times A $$
Where $W$ is $d \times d$ (Huge). $B$ is $d \times r$ and $A$ is $r \times d$ (Small). $r$ is the Rank (e.g., 8 or 16).
- If $d=4096$ and $r=8$:
- Full Matrix: $4096 \times 4096 \approx 16M$ params.
- LoRA Matrices: $4096 \times 8 + 8 \times 4096 \approx 65k$ params.
- Reduction: 99.6% fewer trainable parameters.
3.2. QLoRA (Quantized LoRA)
Dettmers et al. (2023) took it further.
- Load the Base Model ($W_{frozen}$) in 4-bit (NF4 format). (14 GB -> 4 GB).
- Keep the LoRA adapters ($A, B$) in float16.
- Backpropagate gradients through the frozen 4-bit weights into the float16 adapters.
Result: You can fine-tune Llama-2-7B on a single 24GB consumer GPU (RTX 4090). This democratized LLMOps.
4. Implementation: The Hugging Face Stack
The stack involves four libraries:
transformers: The model architecture.peft: The LoRA logic.bitsandbytes: The 4-bit quantization.trl(Transformer Reinforcement Learning): The training loop (SFTTrainer).
4.1. Setup Code
import torch
from datasets import load_dataset
from peft import LoraConfig, get_peft_model
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
TrainingArguments
)
from trl import SFTTrainer
# 1. Quantization Config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
)
# 2. Load Base Model (Frozen)
model_name = "meta-llama/Llama-2-7b-chat-hf"
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto"
)
model.config.use_cache = False # Silence warnings for training
# 3. LoRA Config
peft_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.1,
r=64, # Rank (Higher = smarter but slower)
bias="none",
task_type="CAUSAL_LM",
target_modules=["q_proj", "v_proj"] # Apply to Query/Value layers
)
# 4. Load Dataset
dataset = load_dataset("timdettmers/openassistant-guanaco", split="train")
# 5. Training Arguments
training_args = TrainingArguments(
output_dir="./results",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
logging_steps=10,
max_steps=500, # Quick demo run
fp16=True,
)
# 6. Trainer (SFT - Supervised Fine Tuning)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
peft_config=peft_config,
dataset_text_field="text",
max_seq_length=512,
tokenizer=AutoTokenizer.from_pretrained(model_name),
args=training_args,
)
trainer.train()
4.2. Merging Weights
After training, you have adapter_model.bin (100MB).
To serve it efficiently, you merge it back into the base model.
$$ W_{final} = W_{base} + (B \times A) $$
from peft import PeftModel
# Load Base
base_model = AutoModelForCausalLM.from_pretrained(model_name, ...)
# Load Adapter
model = PeftModel.from_pretrained(base_model, "./results/checkpoint-500")
# Merge
model = model.merge_and_unload()
# Save
model.save_pretrained("./final_model")
5. Beyond SFT: RLHF and DPO
Supervised Fine-Tuning (SFT) teaches the model how to talk. RLHF (Reinforcement Learning from Human Feedback) teaches the model what to want (Safety, Helpfulness).
5.1. The RLHF Pipeline (Hard Mode)
- SFT: Train basic model.
- Reward Model (RM): Train a second model to grade answers.
- PPO: Use Proximal Policy Optimization to update the SFT model to maximize the score from the RM.
- Difficulty: Unstable, complex, requires 3 models in memory.
5.2. DPO (Direct Preference Optimization) (Easy Mode)
Rafailov et al. (2023) proved you don’t need a Reward Model or PPO.
You just need a dataset of (chosen, rejected) pairs.
Mathematically, you can optimize the policy directly to increase the probability of chosen and decrease rejected.
Code Implementation:
from trl import DPOTrainer
# Dataset: { "prompt": "...", "chosen": "Good answer", "rejected": "Bad answer" }
dpo_trainer = DPOTrainer(
model=model,
ref_model=None, # DPO creates a copy of the model implicitly
args=training_args,
beta=0.1, # Temperature for the loss
train_dataset=dpo_dataset,
tokenizer=tokenizer,
)
dpo_trainer.train()
DPO is stable, memory-efficient, and effective. It is the standard for MLOps today.
6. Serving: Multi-LoRA
In traditional MLOps, if you had 5 fine-tuned models, you deployed 5 docker containers. In LLMOps, the base model is 14GB. You cannot afford $14 \times 5 = 70$ GB.
6.1. The Architecture
Since LoRA adapters are tiny (100MB), we can load One Base Model and swap the adapters on the fly per request.
vLLM / LoRAX:
- Request A comes in:
model=customer-service. - Server computes $x \times W_{base} + x \times A_{cs} \times B_{cs}$.
- Request B comes in:
model=sql-generator. - Server computes $x \times W_{base} + x \times A_{sql} \times B_{sql}$.
The base weights are shared. This allows serving hundreds of customized models on a single GPU.
6.2. LoRAX Configuration
# lorax.yaml
model_id: meta-llama/Llama-2-7b-chat-hf
port: 8080
adapter_path: s3://my-adapters/
Client call:
client.generate(prompt="...", adapter_id="sql-v1")
7. Data Prep: The Unsung Hero
The quality of Fine-Tuning is 100% dependent on data.
7.1. Chat Template Formatting
LLMs expect a specific string format.
- Llama-2:
<s>[INST] {user_msg} [/INST] {bot_msg} </s> - ChatML:
<|im_start|>user\n{msg}<|im_end|>\n<|im_start|>assistant
If you mess this up, the model will output gibberish labels like [/INST] in the final answer.
Use tokenizer.apply_chat_template() to handle this automatically.
7.2. Cleaning
- Dedup: Remove duplicate rows.
- Filter: Remove short responses (“Yes”, “I don’t know”).
- PII Scrubbing: Remove emails/phones.
8. Summary
Fine-Tuning has graduated from research labs to cost-effective MLOps.
- Use PEFT (LoRA) to train.
- Use bitsandbytes (4-bit) to save memory.
- Use DPO to align (Safety/Preference).
- Use Multi-LoRA serving to deploy.
In the next section, we explore the middle ground between Prompting and Training: Retrieval Augmented Generation (RAG).
9. Data Engineering for Fine-Tuning
The difference between a mediocre model and a great model is almost always the dataset cleaning pipeline.
9.1. Format Standardization (ChatML)
Raw data comes in JSON, CSV, Parquet. You must standardize to a schema. Standard Schema:
{"messages": [
{"role": "user", "content": "..."},
{"role": "assistant", "content": "..."}
]}
Code: The Preprocessing Pipeline
from datasets import load_dataset
def format_sharegpt_to_messages(example):
# Convert 'conversations' list to standard 'messages'
convo = example['conversations']
new_convo = []
for turn in convo:
role = "user" if turn['from'] == "human" else "assistant"
new_convo.append({"role": role, "content": turn['value']})
return {"messages": new_convo}
dataset = load_dataset("sharegpt_clean")
dataset = dataset.map(format_sharegpt_to_messages)
9.2. PII Scrubbing with Presidio
You assume your training data is private. But LLMs memorize training data. If you fine-tune on customer support logs, the model might output “My phone number is 555-0199”. Use Microsoft Presidio.
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
def scrub_pii(text):
results = analyzer.analyze(text=text, language='en', entities=["PHONE_NUMBER", "EMAIL_ADDRESS"])
anonymized = anonymizer.anonymize(text=text, analyzer_results=results)
return anonymized.text
# Apply to dataset
dataset = dataset.map(lambda x: {"content": scrub_pii(x["content"])})
9.3. MinHash Deduplication
Duplicate data causes the model to overfit (memorize) those specific examples. For 100k examples, use MinHash LSH (Locality Sensitive Hashing).
from text_dedup.minhash import MinHashDedup
# Pseudo-code for library usage
deduper = MinHashDedup(threshold=0.9)
dataset = deduper.deduplicate(dataset, column="content")
10. Scaling Training: When One GPU isn’t Enough
If you move from 7B to 70B models, a single GPU (even A100) will OOM. You need Distributed Training Strategies.
10.1. Distributed Data Parallel (DDP)
- Concept: Replicate the model on every GPU. Split the batch.
- Limit: Model must fit on one GPU (e.g., < 80GB). 70B parameters = 140GB. DDP fails.
10.2. FSDP (Fully Sharded Data Parallel)
- Concept: Shard the model and the optimizer state across GPUs.
- Math: If you have 8 GPUs, each GPU holds 1/8th of the weights. During the forward pass, they communicate (AllGather) to assemble the layer they need, compute, and discard.
- Result: You can train 70B models on 8x A100s.
10.3. DeepSpeed Zero-3
Microsoft’s implementation of sharding.
Usually configured via accelerate config.
ds_config.json:
{
"fp16": {
"enabled": true
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
}
},
"train_batch_size": "auto"
}
- ZeRO Stage 1: Shard Optimizer State (4x memory saving).
- ZeRO Stage 2: Shard Gradients (8x memory saving).
- ZeRO Stage 3: Shard Weights (Linear memory saving with # GPUs).
- Offload: Push weights to System RAM (CPU) when not in use. Slow, but allows training massive models on small clusters.
11. Advanced LoRA Architectures
LoRA is not static. Research moves fast.
11.1. LongLoRA (Context Extension)
Llama-2 context is 4096. What if you need 32k? Fine-tuning with full attention is $O(N^2)$. LongLoRA introduces “Shifted Sparse Attention” (efficient approximation) during fine-tuning.
- Use Case: Summarizing legal contracts.
11.2. DoRA (Weight-Decomposed)
- Paper: Liu et al. (2024).
- Concept: Decompose weight into Magnitude ($m$) and Direction ($V$).
- Result: Outperforms standard LoRA, approaching Full Fine-Tuning performance.
11.3. Q-GaLore
- Concept: Gradients also have Low Rank structure. Project gradients to low rank before backpropagating.
- Result: Train 7B models on 24GB consumer GPU without quantization.
12. Hands-On Lab: Training a SQL Coder
Let’s build a model that converts English to PostgreSQL.
Goal: Fine-tune CodeLlama-7b on the spider dataset.
Step 1: Data Formatting
The Spider dataset has question and query.
We format into an Instruction Prompt:
[INST] You are a SQL Expert. Convert this question to SQL.
Schema: {schema}
Question: {question} [/INST]
Step 2: Packing
We concatenate examples into chunks of 4096 tokens using ConstantLengthDataset from trl.
Why? To minimize padding. If Example A is 100 tokens and Example B is 2000, we waste computation padding A. Packing puts them in the same buffer separated by EOS token.
Step 3: Training
We launch SFTTrainer with neftune_noise_alpha=5.
NEFTune: Adding noise to embeddings during fine-tuning improves generalization.
Step 4: Evaluation
We cannot use “Exact Match” because SQL is flexible (SELECT * vs SELECT a,b).
We use Execution Accuracy.
- Spin up a Docker Postgres container.
- Run Ground Truth Query -> Result A.
- Run Predicted Query -> Result B.
- If A == B, then Correct.
This is the only valid way to evaluate code models.
13. Deep Dive: The RLHF Implementation Details
While DPO is current SOTA for simplicity, understanding RLHF/PPO is critical because DPO assumes you already have a preference dataset. Often, you need to build the Reward Model yourself to label data.
13.1. The Reward Model (RM)
The RM is a BERT-style regressor. It reads a (prompt, response) pair and outputs a scalar score (e.g., 4.2).
Loss Function: Pairwise Ranking Loss. $$ L = -\log(\sigma(r(x, y_w) - r(x, y_l))) $$ Where $r$ is the reward model, $y_w$ is the winning response, $y_l$ is the losing response. We want the score of the winner to be higher than the loser.
from trl import RewardTrainer
reward_trainer = RewardTrainer(
model=reward_model,
tokenizer=tokenizer,
train_dataset=dataset, # Columns: input_ids_chosen, input_ids_rejected
)
reward_trainer.train()
13.2. PPO (Proximal Policy Optimization) Step
Once we have the Reward Model, we freeze it. We clone the SFT model into a “Policy Model” (Actor).
The Optimization Loop:
- Rollout: Policy Model generates a response $y$ for prompt $x$.
- Evaluate: Reward Model scores $y \rightarrow R$.
- KL Penalty: We compute the KL Divergence between the Current Policy and the Initial SFT Policy.
$R_{final} = R - \beta \log( \frac{\pi_{new}(y|x)}{\pi_{old}(y|x)} )$
- Why? To prevent “Reward Hacking” (e.g., model just spams “Good! Good! Good!” because the RM likes positive sentiment). We force it to stay close to English.
- Update: Use PPO gradient update to shift weights.
14. Evaluation: The Automated Benchmarks
You fine-tuned your model. Is it smarter? Or did it forget Physics (Catastrophic Forgetting)? You must run the LLM Evaluation Harness.
14.1. Key Benchmarks
- MMLU (Massive Multitask Language Understanding): 57 subjects (STEM, Humanities). The standard IQ test.
- GSM8K: Grade School Math. Tests multi-step reasoning.
- HumanEval: Python coding problems.
- HellaSwag: Common sense completion.
14.2. Running the Harness
Install the standard library by EleutherAI.
pip install lm-eval
lm_eval --model hf \
--model_args pretrained=./my-finetuned-model \
--tasks mmlu,gsm8k \
--device cuda:0 \
--batch_size 8
Scale:
- MMLU 25%: Random guessing (4 choices).
- Llama-2-7B: ~45%.
- GPT-4: ~86%. If your fine-tune drops MMLU from 45% to 30%, you have severe Overfitting/Forgetting.
15. The Cost of Fine-Tuning Estimator
How much budget do you need?
15.1. Compute Requirements (Rule of Thumb)
For QLoRA (4-bit):
- 7B Model: 1x A10G (24GB VRAM). AWS
g5.2xlarge. - 13B Model: 1x A100 (40GB). AWS
p4d.24xlarge(shard). - 70B Model: 4x A100 (80GB). AWS
p4de.24xlarge.
15.2. Time (Example)
Llama-2-7B, 10k examples, 3 epochs.
- Token Total: $10,000 \times 1024 \times 3 = 30M$ tokens.
- Training Speed (A10G): ~3000 tokens/sec.
- Time: $10,000$ seconds $\approx$ 3 hours.
- Cost:
g5.2xlargeis $1.21/hr. - Total Cost: $4.00.
Conclusion: Fine-tuning 7B models is trivially cheap. The cost is Data Preparation (Engineer salaries), not Compute.
16. Serving Architecture: Throughput vs Latency
After training, you have deployment choices.
16.1. TGI (Text Generation Inference)
Developed by Hugging Face. Highly optimized Rust backend.
- Continuous Batching.
- PagedAttention (Memory optimization).
- Tensor Parallelism.
docker run --gpus all \
-v $PWD/models:/data \
ghcr.io/huggingface/text-generation-inference:1.0 \
--model-id /data/my-finetuned-model \
--quantize bitsandbytes-nf4
16.2. vLLM
Developed by UC Berkeley.
- Typically 2x faster than TGI.
- Native support for OpenAI API protocol.
python -m vllm.entrypoints.openai.api_server \
--model ./my-model \
--lora-modules sql-adapter=./adapters/sql
17. Ops Pattern: The “Blue/Green” Foundation Model Update
Fine-tuning pipelines are continuous.
- Data Collection: Collect “Thumbs Up” chat logs from production.
- Nightly Training: Run SFT on the new data + Golden Set.
- Auto-Eval: Run MMLU + Custom Internal Eval.
- Gate: If Score > Baseline, tag
v1.2. - Deploy:
- Route 1% of traffic to
v1.2model. - Monitor “Acceptance Rate” (User doesn’t regenerate).
- Promote to 100%.
- Route 1% of traffic to
This is LLMOps: Continuous Improvement of the cognitive artifact.
18. Troubleshooting: The Dark Arts of LLM Training
Training LLMs is notoriously brittle. Here are the common failure modes.
18.1. The “Gibberish” Output
- Symptom: Model outputs
########or repeatsThe The The. - Cause: EOS Token Mismatch.
- Llama-2 uses token
2. - Your tokenizer thinks EOS is
0. - The model never learns to stop, eventually outputting OOD tokens.
- Llama-2 uses token
- Fix: Explicitly set
tokenizer.pad_token = tokenizer.eos_token(a common hack) or ensurespecial_tokens_map.jsonmatches the base model configuration perfectly.
18.2. Loss Spikes (The Instability)
- Symptom: Loss goes down nicely to 1.5, then spikes to 8.0 and never recovers.
- Cause: “Bad Batches”. A single example with corrupted text (e.g., a binary file read as text, resulting in a sequence of 4096 random unicode chars) generates massive gradients.
- Fix:
- Gradient Clipping (
max_grad_norm=1.0). - Pre-filtering data for Perplexity (remove high-perplexity outliers before training).
- Gradient Clipping (
18.3. Catastrophic Forgetting
- Symptom: The model writes great SQL (your task) but can no longer speak English or answer “Who are you?”.
- Cause: Over-training on a narrow distribution.
- Fix:
- Replay Buffer: Mix in 5% of the original pre-training dataset (e.g., generic web text) into your fine-tuning set.
- Reduce
num_epochs. Usually 1 epoch is enough for SFT.
19. Lab: Fine-Tuning for Self-Correction
A powerful agentic pattern is Self-Correction. We can fine-tune a model specifically to find bugs in its own code.
19.1. Dataset Generation
We need pairs of (Bad Code, Critique). We can generate this synthetically using GPT-4.
# generator.py
problem = "Sort a list"
buggy_code = "def sort(x): return x" # Wrong
critique = "The function returns the list unmodified. It should use x.sort() or sorted(x)."
example = f"""[INST] Critique this code: {buggy_code} [/INST] {critique}"""
19.2. Training
Train a small adapter lora_critic.
19.3. Inference Loop
def generate_robust_code(prompt):
# 1. Generate Draft (Base Model)
code = base_model.generate(prompt)
# 2. Critque (Critic Adapter)
# We hot-swap the adapter!
base_model.set_adapter("critic")
critique = base_model.generate(f"Critique: {code}")
# 3. Refine (Base Model)
base_model.disable_adapter()
final_code = base_model.generate(f"Fix this code: {code}\nFeedback: {critique}")
return final_code
This “Split Personality” approach allows a single 7B model to act as both Junior Dev and Senior Reviewer.
20. Glossary of Fine-Tuning Terms
- Adapter: A small set of trainable parameters added to a frozen model.
- Alignment: The process of making a model follow instructions and human values (SFT + RLHF).
- Catastrophic Forgetting: When learning a new task overwrites the knowledge of old tasks.
- Chat Template: The specific string formatting (
<s>[INST]...) required to prompt a chat model. - Checkpointer: Saving model weights every N steps.
- DPO (Direct Preference Optimization): Optimizing for human preferences without a Reward Model.
- EOS Token (End of Sentence): The special token that tells the generation loop to stop.
- Epoch: One full pass over the training dataset. For LLMs, we rarely do more than 1-3 epochs.
- FSDP (Fully Sharded Data Parallel): Splitting model parameters across GPUs to save memory.
- Gradient Accumulation: Simulating a large batch size (e.g., 128) by running multiple small forward passes (e.g., 4) before one backward update.
- Instruction Tuning: Fine-tuning on a dataset of
(Directive, Response)pairs. - LoRA (Low-Rank Adaptation): Factorizing weight updates into low-rank matrices.
- Mixed Precision (FP16/BF16): Training with 16-bit floats to save memory and speed up tensor cores.
- Quantization: Representing weights with fewer bits (4-bit or 8-bit).
- RLHF: Reinforcement Learning from Human Feedback.
- SFT (Supervised Fine-Tuning): Standard backprop on labeled text.
21. Appendix: BitsAndBytes Config Reference
Using QuantizationConfig correctly is 90% of the battle in getting QLoRA to run.
| Parameter | Recommended | Description |
|---|---|---|
load_in_4bit | True | Activates the 4-bit loading. |
bnb_4bit_quant_type | "nf4" | “Normal Float 4”. Optimized for Gaussian distribution of weights. Better than “fp4”. |
bnb_4bit_compute_dtype | torch.bfloat16 | The datatype used for matrix multiplication. BF16 is better than FP16 on Ampere GPUs (prevent overflow). |
bnb_4bit_use_double_quant | True | Quantizes the quantization constants. Saves ~0.5GB VRAM per 7B params. |
Example Config:
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
22. Annotated Bibliography
1. “LoRA: Low-Rank Adaptation of Large Language Models”
- Hu et al. (Microsoft) (2021): The paper that started the revolution. Showed rank-decomposition matches full fine-tuning.
2. “QLoRA: Efficient Finetuning of Quantized LLMs”
- Dettmers et al. (2023): Introduced 4-bit Normal Float (NF4) and Paged Optimizers, enabling 65B training on 48GB cards.
3. “Direct Preference Optimization: Your Language Model is Secretly a Reward Model”
- Rafailov et al. (Stanford) (2023): Replaced the complex PPO pipeline with a simple cross-entropy-like loss.
4. “Llama 2: Open Foundation and Fine-Tuned Chat Models”
- Meta (2023): The technical report detailing the SFT and RLHF pipelines used to create Llama-2-Chat. Essential reading for process.
23. Final Summary
Fine-tuning is no longer “re-training”. It is “adaptation”.
Tools like Hugging Face trl and peft have turned a task requiring a PhD and a Supercomputer into a task requiring a Python script and a Gaming PC.
Your MLOps Pipeline:
- Format raw logs into ChatML.
- Train using QLoRA (cheap).
- Evaluate using DPO.
- Serve using vLLM + Adapters.
In the final section of this Generative AI trilogy, we combine Prompting (20.1) and Model Knowledge (20.2) with External Knowledge: Retrieval Augmented Generation (RAG) Operations.
24. Bonus Lab: Fine-Tuning for Function Calling (Agents)
Most open-source models (Llama-2) are bad at outputting strict JSON for function calling. We can fix this.
24.1. The Data Format
We need to teach the model a specific syntax for invoking tools. Glaive Format:
USER: Check the weather in London.
ASSISTANT: <tool_code> get_weather(city="London") </tool_code>
TOOL OUTPUT: 15 degrees, Rainy.
ASSISTANT: It is 15 degrees and rainy in London.
24.2. Synthetic Generation
We can use GPT-4 to generate thousands of variations of function calls for our specific API schema.
api_schema = "get_stock_price(symbol: str)"
prompt = f"Generate 10 user queries that would trigger this API: {api_schema}"
24.3. Formatting the Tokenizer
We must add <tool_code> and </tool_code> as Special Tokens so they are predicted as a single unit and not split.
tokenizer.add_special_tokens(
{"additional_special_tokens": ["<tool_code>", "</tool_code>"]}
)
model.resize_token_embeddings(len(tokenizer))
Result: A 7B model that reliably outputs valid function calls, enabling you to build custom Agents without paying OpenAI prices.
25. Advanced Viz: Dataset Cartography
Swayamdipta et al. (2020) proposed a way to map your dataset quality.
- Confidence: How probable is the correct label? (Easy vs Hard).
- Variability: How much does the confidence fluctuate during training? (Ambiguous).
Regions:
- Easy-to-Learn: High Confidence, Low Variability. (Model learns these instantly. Can be pruned).
- Ambiguous: Medium Confidence, High Variability. (The most important data for generalization).
- Hard-to-Learn: Low Confidence, Low Variability. (Usually labeling errors).
25.1. The Code
import matplotlib.pyplot as plt
import pandas as pd
# Assume we logged (id, epoch, confidence) during training callback
df = pd.read_csv("training_dynamics.csv")
# Compute metrics
stats = df.groupby("id").agg({
"confidence": ["mean", "std"]
})
stats.columns = ["confidence_mean", "confidence_std"]
# Plot
plt.figure(figsize=(10, 8))
plt.scatter(
stats["confidence_std"],
stats["confidence_mean"],
alpha=0.5
)
plt.xlabel("Variability (Std Dev)")
plt.ylabel("Confidence (Mean)")
plt.title("Dataset Cartography")
# Add regions
plt.axhline(y=0.8, color='g', linestyle='--') # Easy
plt.axhline(y=0.2, color='r', linestyle='--') # Hard/Error
Action: Filter out the “Hard” region (Confidence < 0.2). These represent mislabeled data that confuse the model. Retraining usually improves.
26. Hardware Guide: Building an LLM Rig
“What GPU should I buy?”
26.1. The “Hobbyist” ($1k - $2k)
- GPU: 1x NVIDIA RTX 3090 / 4090 (24GB VRAM).
- Capability:
- Serve Llama-2-13B (4-bit).
- Fine-tune Llama-2-7B (QLoRA).
- Cannot fine-tune 13B (OOM).
26.2. The “Researcher” ($10k - $15k)
- GPU: 4x RTX 3090/4090 (96GB VRAM Total).
- Motherboard: Threadripper / EPYC (Specific PCIe lane requirements).
- Capability:
- Fine-tune Llama-2-70B (QLoRA + DeepSpeed ZeRO-3).
- Serve Llama-2-70B (4-bit).
26.3. The “Startup” ($200k+)
- GPU: 8x H100 (80GB).
- Capability:
- Training new base models from scratch (small scale).
- Full fine-tuning of 70B models.
- High-throughput serving (vLLM).
Recommendation: Start with the Cloud (AWS g5 instances). Only buy hardware if you have 24/7 utilization.
27. Final Checklist: The “Ready to Train” Gate
Do not run python train.py until:
- Data Cleaned: Deduplicated and PII scrubbed.
- Format Verified:
tokenizer.apply_chat_templateworks and<s>tokens look correct. - Baseline Run: Evaluation (MMLU) run on the base model to establish current IQ.
- Loss Monitored: W&B logging enabled to catch loss spikes.
- Artifact Store: S3 bucket ready for checkpoints (don’t save to local ephemeral disk).
- Cost Approved: “This run will cost $XX”.
Good luck. May your gradients flow and your loss decrease.
28. Appendix: SFTTrainer Cheat Sheet
The TrainingArguments class has hundreds of parameters. These are the critical ones for LLMs.
| Argument | Recommended | Why? |
|---|---|---|
per_device_train_batch_size | 1 or 2 | VRAM limits. Use Gradient Accumulation to increase effective batch size. |
gradient_accumulation_steps | 4 - 16 | Effective BS = Device BS $\times$ GPU Count $\times$ Grad Accum. Target 64-128. |
gradient_checkpointing | True | Critical. Trades Compute for Memory. Allows fitting 2x larger models. |
learning_rate | 2e-4 (LoRA) | LoRA needs higher LR than Full Finetuning (2e-5). |
lr_scheduler_type | "cosine" | Standard for LLMs. |
warmup_ratio | 0.03 | 3% warmup. Stabilizes training at start. |
max_grad_norm | 0.3 or 1.0 | Clips gradients to prevent spikes (instability). |
bf16 | True | Use Brain Float 16 if on Ampere (A100/3090). Better numerical stability than FP16. |
group_by_length | True | Sorts dataset by length to minimize padding. 2x speedup. |
logging_steps | 1 | LLM training is expensive. You want to see the loss curve updates instantly. |
29. Future Outlook: MoE and Self-Play
Fine-tuning is evolving.
29.1. Mixture of Experts (MoE)
Mixtral 8x7B showed that sparse models are better. Fine-tuning MoEs (like QLoRA for Mixtral) requires specialized care—you must ensure the Router Network doesn’t collapse (ranking only 1 expert for everything).
- Config:
target_modules=["w1", "w2", "w3"]for all experts.
29.2. Self-Play (SPIN)
Self-Play Fine-Tuning (SPIN) allows a model to improve without new human data.
- Model generates answer A.
- We take old model Answer B.
- We train Model to prefer A over B. This iterates, creating a superhuman flywheel (AlphaGo style), purely on text.
The future of LLMOps is not compiling datasets manually. It is building Synthetic Data Engines that allow models to teach themselves.
30. Code Snippet: LoRA from Scratch
To truly understand LoRA, implement it without peft.
import torch
import torch.nn as nn
import math
class LoRALayer(nn.Module):
def __init__(self, in_dim, out_dim, rank, alpha):
super().__init__()
self.std_dev = 1 / torch.sqrt(torch.tensor(rank).float())
# Matrix A: Initialization is Random Gaussian
self.A = nn.Parameter(torch.randn(in_dim, rank) * self.std_dev)
# Matrix B: Initialization is Zero
# This ensures that at step 0, LoRA does nothing (Identity)
self.B = nn.Parameter(torch.zeros(rank, out_dim))
self.alpha = alpha
self.rank = rank
def forward(self, x):
# x shape: (Batch, Seq, In_Dim)
# B @ A shape: (In_Dim, Out_Dim) -> Transposed usually
# (x @ A) @ B
# Scaling: alpha / rank
x = max(self.rank, self.alpha) / self.rank * (x @ self.A @ self.B)
return x
class LinearWithLoRA(nn.Module):
def __init__(self, linear_layer, rank, alpha):
super().__init__()
self.linear = linear_layer
self.linear.requires_grad_(False) # Freeze Base
self.lora = LoRALayer(
linear_layer.in_features,
linear_layer.out_features,
rank,
alpha
)
def forward(self, x):
# Wx + BAx
return self.linear(x) + self.lora(x)
# Usage
# original = model.layers[0].self_attn.q_proj
# model.layers[0].self_attn.q_proj = LinearWithLoRA(original, rank=8, alpha=16)
Why Init B to Zero? If B is zero, $B \times A = 0$. So $W_{frozen} + 0 = W_{frozen}$. This guarantees the model starts exactly as the pre-trained model. If we initialized randomly, we would inject noise and destroy the model’s intelligence instantly.
31. References
1. “Scaling Laws for Neural Language Models”
- Kaplan et al. (OpenAI) (2020): The physics of LLMs. Explains why we need so much data.
2. “Training Compute-Optimal Large Language Models (Chinchilla)”
- Hoffmann et al. (DeepMind) (2022): Adjusted the scaling laws. Showed that most models are undertrained.
3. “LIMA: Less Is More for Alignment”
- Zhou et al. (Meta) (2023): Showed that you only need 1,000 high quality examples to align a model, debunking the need for 50k RLHF datasets. This is the paper that justifies Manual Data Cleaning.
4. “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models”
- Rajbhandari et al. (Microsoft) (2019): The foundation of DeepSpeed.
5. “NEFTune: Noisy Embeddings Improve Instruction Finetuning”
- Jain et al. (2023): A simple trick (adding noise to embeddings) that surprisingly boosts AlpacaEval scores by ~10%.
End of Chapter 20.2 Proceed to Chapter 20.3: RAG Operations.
32. Glossary of GPU Hardware
When requesting quota, know what you are asking for.
- H100 (Hopper): The King. 80GB VRAM. Transformer Engine (FP8). 3x faster than A100.
- A100 (Ampere): The Workhorse. 40GB/80GB. Standard for training 7B-70B models.
- A10G: The Inference Chip. 24GB. Good for serving 7B models or fine-tuning small LoRAs.
- L4 (Lovelace): The successor to T4. 24GB. Excellent for video/inference.
- T4 (Turing): Old, cheap (16GB). Too slow for training LLMs. Good for small BERT models.
- TPU v4/v5 (Google): Tensor Processing Unit. Google’s custom silicon. Requires XLA/JAX ecosystem (or PyTorch/XLA). Faster/Cheaper than NVIDIA if you know how to use it.
33. Final Checklist
- Loss is decreasing: If loss doesn’t drop in first 10 steps, kill it.
- Eval is improving: If MMLU drops, stop.
- Cost is tracked: Don’t leave a
p4dinstance running over the weekend. - Model is saved: Push to Hugging Face Hub (
trainer.push_to_hub()).
Go forth and fine-tune.
34. Acknowledgements
Thanks to the open source community: Tim Dettmers (bitsandbytes), Younes Belkada (PEFT), and the Hugging Face team (Transformer Reinforcement Learning). Without their work, LLMs would remain the exclusive property of Mega-Corps.
Now, let’s learn how to connect these brains to a database. See you in Chapter 20.3.