Keyboard shortcuts

Press or to navigate between chapters

Press ? to show this help

Press Esc to hide this help

36.3. Advanced Text Augmentation & Synthetic Data

In the era of Large Language Models (LLMs), the primary constraint on building powerful NLP systems has shifted from Model Architecture (which is mostly commoditized via Transformers) to Data Engineering. Training data is the new codebase. And just like we write unit tests, run linters, and refactor our code, we must apply rigorous engineering principles—including augmentation, synthesis, and version control—to our text datasets.

This section explores the “Data-Centric AI” workflow for NLP, focusing on high-throughput synthetic data generation pipelines implemented in Rust to feed hungry models like Llama 3 or Mistral.

The Case for Synthetic Data

Traditional augmentation in Computer Vision (rotation, crop, flip, color jitter) is semantics-preserving. A rotated cat is still a cat. In NLP, “flipping” a sentence (reversing word order) destroys meaning and grammar. Therefore, NLP augmentation requires higher-level semantic operations that are traditionally hard to automate.

Why Synthetic?

  1. Long-Tail Handling: Real-world usage follows a Power Law. You might have 1,000,000 examples of “Reset Password” intent but only 15 examples of “Update 2FA Settings via YubiKey”. Models fail on the tail. Synthetic data fills this gap.
  2. Privacy & Compliance: You cannot train on real PII-laden customer chat logs without heavy redaction (which hurts utility). Synthetic replicas allow you to train on detailed, realistic scenarios without exposing a single real user’s data.
  3. Cold Start (The “Zero-to-One” Problem): You want to launch a new feature (e.g., “Cancel Subscription”) but have zero logs for it yet. You need to bootstrap the intent classifier.
  4. Adversarial Hardening: You can deliberately generate “Red Teaming” data (injections, ambiguity, toxicity) to train your model to refuse or handle them gracefully.

Technique 1: Deterministic Augmentation (EDA)

Before reaching for expensive GPUs, use CPU-bound deterministic techniques for robustness. The “Easy Data Augmentation” (EDA) paper proposed four simple operations that prevent over-fitting on small datasets.

1. Synonym Replacement

Randomly choose $n$ words which are not stop words. Replace each of these words with one of its synonyms chosen at random.

  • Rust Implementation: Load a HashMap<String, Vec<String>> Thesaurus (e.g., WordNet dump).
  • Optimized: Use rand::seq::SliceRandom for $O(1)$ selection.

2. Random Insertion

Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random position in the sentence. Repeat $n$ times.

  • Effect: Changes sentence structure slightly but usually preserves intent.

3. Random Swap

Randomly choose two words in the sentence and swap their positions. Repeat $n$ times.

  • Risk: Can destroy grammar (“I ate the apple” -> “The ate I apple”). Use with low probability ($\alpha=0.05$).

4. Random Deletion

Randomly remove each word in the sentence with probability $p$.

  • Effect: Forces the model to focus on the remaining keywords rather than memorizing the exact sequence.

Technique 2: Back-Translation

The classic robust augmentation method. Process: Original (En) -> Model A -> Intermediate (Fr/De/Zh) -> Model B -> Paraphrase (En). Effect: Introduces lexical diversity while preserving semantics. “I am happy” -> “Je suis content” -> “I am content”.

MLOps Challenge: Latency and Cost. Running millions of rows through two translation models is computationally expensive. Solution: Offline Batch Processing with quantized CPU models.

Rust Implementation: Async Batch Back-Translation

Using reqwest for APIs or candle/ort for local inference. Here we simulate a high-throughput pipeline using tokio.

#![allow(unused)]
fn main() {
use reqwest::Client;
use serde::{Deserialize, Serialize};
use futures::stream::{self, StreamExt};
use std::sync::Arc;
use tokio::sync::Semaphore;

#[derive(Serialize)]
struct TranslateReq {
    q: String,
    source: String,
    target: String,
}

#[derive(Deserialize)]
struct TranslateResp {
    translatedText: String,
}

struct AugmentationEngine {
    client: Client,
    semaphore: Arc<Semaphore>, // Rate limiter critical for API budgets
}

impl AugmentationEngine {
    pub fn new(concurrency: usize) -> Self {
        Self {
            client: Client::new(),
            semaphore: Arc::new(Semaphore::new(concurrency)),
        }
    }

    async fn back_translate(&self, text: String) -> Option<String> {
        let _permit = self.semaphore.acquire().await.ok()?;
        
        // 1. En -> Fr
        let mid = self.translate(&text, "en", "fr").await?;
        // 2. Fr -> En
        let final_text = self.translate(&mid, "fr", "en").await?;
        
        // Basic filter: Don't keep if identical
        if final_text.trim().to_lowercase() == text.trim().to_lowercase() {
            None 
        } else {
            Some(final_text)
        }
    }

    async fn translate(&self, text: &str, src: &str, tgt: &str) -> Option<String> {
        // In production, use a robust library like 'backon' for exponential backoff retries
        let res = self.client.post("http://localhost:5000/translate")
            .json(&TranslateReq { q: text.to_string(), source: src.to_string(), target: tgt.to_string() })
            .send().await.ok()?;
        let json: TranslateResp = res.json().await.ok()?;
        Some(json.translatedText)
    }

    pub async fn run_pipeline(&self, dataset: Vec<String>) -> Vec<String> {
        stream::iter(dataset)
            .map(|text| self.back_translate(text))
            .buffer_unordered(100) // Keep 100 futures in flight
            .filter_map(|res| async { res }) 
            .collect::<Vec<_>>()
            .await
    }
}
}

Technique 3: Self-Instruct Framework (The Alpaca Recipe)

This is the current “Gold Standard”. Use a Teacher Model (GPT-4, Claude 3 Opus) to generate training data for a Student Model (DistilBERT, Llama-3-8B).

The Prompting Flywheel

You cannot just say “Generate 1,000 sentences.” The LLM will loop and produce repetitive, generic garbage (“Mode Collapse”). You need Seed Data and Persona Injection.

Algorithm: Self-Instruct

  1. Seed: Start with 10 hand-written examples of the task.
    {"task": "Classify sentiment", "input": "I loved the movie", "output": "Positive"}
    
  2. Generate Instructions: Ask LLM to generate 10 new instructions similar to the seed.
  3. Filter: Remove instructions that have high ROUGE overlap with existing ones.
  4. Generate Outputs: Ask LLM to answer the new instructions.
  5. Loop: Add new pairs to the Seed pool and repeat.

Advanced: Implementing Evol-Instruct in Rust

Standard Self-Instruct hits a ceiling. Evol-Instruct (WizardLM) creates progressively harder instructions.

#![allow(unused)]
fn main() {
use async_openai::{
    types::{CreateChatCompletionRequestArgs, ChatCompletionRequestMessage},
    Client,
};

enum EvolutionType {
    Deepening,
    Concretizing,
    Reasoning,
    Constraints,
}

struct Evolver {
    client: Client<async_openai::config::OpenAIConfig>,
}

impl Evolver {
    async fn evolve(&self, instruction: &str, method: EvolutionType) -> String {
        let prompt = match method {
            EvolutionType::Deepening => format!("Reword the following inquiry to require more complex reasoning: '{}'", instruction),
            EvolutionType::Constraints => format!("Add a constraint to the following inquiry (e.g. word count, forbidden words): '{}'", instruction),
            // ... other cases
            _ => instruction.to_string(),
        };

        let request = CreateChatCompletionRequestArgs::default()
            .model("gpt-4")
            .messages([ChatCompletionRequestMessage::User(prompt.into())])
            .build().unwrap();

        let response = self.client.chat().create(request).await.unwrap();
        response.choices[0].message.content.clone().unwrap()
    }
}
}

Technique 4: Genetic Prompt Optimization

To maximize the quality of synthetic data, we can “evolve” the prompts themselves.

Algorithm:

  1. Population: Start with 10 prompts.
  2. Evaluate: Generate data with each prompt. Score data with a critic model.
  3. Select: Keep top 5 prompts.
  4. Mutate: Ask LLM to “rewrite this prompt to be more specific”.
  5. Crossover: Combine two prompts.
  6. Loop.

Managing Augmentation Artifacts

Augmented data is derived data. It is often 10x or 100x larger than the seed data. Storage and versioning become critical.

Data Version Control (DVC) Integration

Do not track the CSVs in Git. Use DVC. Treat the augmentation script as a DVC stage.

# dvc.yaml
stages:
  augment:
    cmd: cargo run --bin augment -- --input data/seed.csv --output data/train_v2.parquet
    deps:
      - data/seed.csv
      - src/bin/augment.rs
      - config/prompts.yaml
    outs:
      - data/train_v2.parquet:
          cache: true
    metrics:
      - metrics/diversity_score.json

Parquet: Always use Parquet (via polars in Rust) for augmented datasets. It compresses effectively (text often compresses 5x-10x) and supports columnar access (fast for reading just the “text” column for training).

Vector Store Abstraction for RAG-Augmentation

When generating data, retrieving relevant context is key. We need a robust Vector Store abstraction in Rust.

#![allow(unused)]
fn main() {
use async_trait::async_trait;
use anyhow::Result;

#[derive(Debug)]
pub struct ScoredPoint {
    pub id: String,
    pub score: f32,
    pub payload: serde_json::Value,
}

#[async_trait]
pub trait VectorStore {
    async fn insert(&self, points: Vec<ScoredPoint>) -> Result<()>;
    async fn search(&self, query: Vec<f32>, top_k: usize) -> Result<Vec<ScoredPoint>>;
}

// Qdrant Implementation
pub struct QdrantStore {
    client: qdrant_client::QdrantClient,
    collection: String,
}

#[async_trait]
impl VectorStore for QdrantStore {
    async fn insert(&self, points: Vec<ScoredPoint>) -> Result<()> {
        // Map points to Qdrant PointStruct...
        Ok(())
    }
    
    async fn search(&self, query: Vec<f32>, top_k: usize) -> Result<Vec<ScoredPoint>> {
        // Call search_points...
        Ok(vec![]) 
    }
}
}

Quality Assurance: The “Critic” Loop

Blindly adding synthetic data often hurts model performance (“Model Poisoning” or “Autophagous Loops”). You need a selection mechanism.

1. Semantic Consistency Check

Does the augmented sentence actually mean the same thing?

  • Idea: Use a Sentence Transformer (e.g., all-MiniLM-L6-v2) to embed both original and augmented examples.
  • Filter: If cosine_similarity(orig, aug) < 0.85, discard.

2. Diversity Check (Embedding Distance)

Are we just duplicating data?

  • Logic: Compute embeddings for the entire synthetic set.
  • Metric: Average pairwise distance. If too low, your synthetic data is repetitive.
  • Visualization: Use UMAP to reduce to 2D and look for “clumps”. Good data covers the space uniformly.

3. LLM-as-a-Judge

Use a second, independent LLM prompt to grade the quality of the generated data.

  • Prompt: “Rate the following user query for realism on a scale of 1-5. Output JSON.”
  • Filter: Discard anything < 4.

Rust Implementation: Semantic Filtering with candle

Using candle (Hugging Face’s Rust ML framework) to run BERT embeddings on CPU/GPU for filtration.

#![allow(unused)]
fn main() {
use candle_core::{Tensor, Device};
// Pseudo-code for embedding extraction
struct EmbeddingFilter {
    // A simplified BERT model struct
    tokenizer: Tokenizer,
}

impl EmbeddingFilter {
    pub fn is_semantically_similar(&self, t1: &str, t2: &str, threshold: f32) -> bool {
        let e1 = self.embed(t1);
        let e2 = self.embed(t2);
        let sim = cosine_similarity(&e1, &e2);
        sim >= threshold
    }

    fn embed(&self, text: &str) -> Vec<f32> {
        // Full Candle BERT execution logic would go here
        // 1. Tokenize
        // 2. Models forward
        // 3. Extract CLS token
        vec![0.1, 0.2, 0.3] // placeholder
    }
}

fn cosine_similarity(a: &[f32], b: &[f32]) -> f32 {
    let dot_product: f32 = a.iter().zip(b).map(|(x, y)| x * y).sum();
    let norm_a: f32 = a.iter().map(|x| x * x).sum::<f32>().sqrt();
    let norm_b: f32 = b.iter().map(|x| x * x).sum::<f32>().sqrt();
    dot_product / (norm_a * norm_b)
}
}

Entity Replacement & Noise Injection

For Named Entity Recognition (NER), simply swapping names is a powerful augmentation.

  • “I met Alice in Paris” -> “I met Bob in Austin”.

Rust Implementation Note: This requires accurate pre-identification of tags. Use the presidio analyzer logic (or Rust regex) to identify placeholders, then sample from a fast lookup table (e.g., specialized faker crate equivalents or raw CSV lists).

Noise Injection: Simulate ASR errors or typos.

  • Keyboard Distance Swaps: Replace ‘k’ with ‘l’ or ‘j’ (adjacent keys).
  • Char Deletion: “meeting” -> “meetin”.
  • Char Insertion: “hello” -> “helllo”.

These simple corruptions make models extremely robust to real-world messy user input.

Watermarking Synthetic Data

If you publish datasets, you might want to prove they are synthetic (or detect if future models are training on your synthetic output).

The “Red List / Green List” Method (Soft Watermarking):

  1. Divide vocabulary of size $|V|$ into Green list $G$ and Red list $R$ based on a hash of the previous token $t_{i-1}$.
  2. During generation, slightly bias logits towards Green tokens: $l_v = l_v + \delta \text{ if } v \in G$.
  3. Detection: A text with a statistically impossible number of “Green” tokens is watermarked.
  4. Z-Score: Compute the Z-score of the Green token count under the null hypothesis (random generation).

MLOps Implication: Store the Watermark Key/Seed securely. This allows you to audit if your proprietary synthetic data has leaked into public datasets or competitors’ models.

Active Learning: The Feedback Loop

Augmentation should be targeted. Don’t augment everything. Augment what the model finds “hard”.

  1. Train Model V1 on Seed Data.
  2. Inference on a large generic pool of unlabeled text (or synthetic candidates).
  3. Uncertainty Sampling: Select examples with High Entropy (model is confused) or Low Confidence.
  4. Label/Augment: Send only these hard examples to the LLM (or human) for labeling/correction.
  5. Retrain: Model V2.

This “Active Learning” loop reduces data costs by 10x-100x compared to random sampling.

Data Deduplication at Scale (SemDeDup)

Generating millions of synthetic examples leads to extensive redundancy. Deduplication is vital to prevent overfitting.

  • Exact Dedup: Use SHA256 hashes of normalized text. Eliminates copy-paste errors.
  • MinHash LSH: Fuzzy deduplication for “near-duplicates” (sentences that vary by 1 word).
  • Embedding Clustering: Cluster embeddings (using K-Means GPU) and keep only the centroid + outliers from each cluster.

Summary Strategy

  1. Start Small: Deterministic augmentation (synonyms, typos) is free and helps robustness.
  2. Scale Up: Use Self-Instruct loops with GPT-4 for “Golden” synthetic data.
  3. Filter Aggressively: Semantic dedup and diversity checks are mandatory.
  4. Version: Use DVC + Parquet.
  5. Target: Use Active Learning to focus augmentation on the model’s weak points.
  6. Protect: Watermark your synthetic outputs to trace their provenance.