Keyboard shortcuts

Press or to navigate between chapters

Press ? to show this help

Press Esc to hide this help

36.2. Text Preprocessing Pipelines

Garbage in, garbage out. In NLP, “garbage” often looks like invisible control characters, mismatched encodings, or subtly different whitespace that humans ignore but machines stumble over. While a researcher might scrub data in a Jupyter notebook using pandas and ad-hoc string replacements, an MLOps engineer must build a Text Preprocessing Pipeline that is reproducible, scalable, and identical across training and serving.

This section details how to build high-performance text pipelines, focusing on the strict determinism required for production NLP, implemented in Rust.

The Production Preprocessing Gap

In research, preprocessing is often “done once” on a static CSV file. In production, preprocessing must happen in milliseconds on a single request, or in parallel across terabytes of daily logs.

Common Anti-Patterns:

  1. Regex Recursion: Using Python’s re module with complex look-behinds that trigger catastrophic backtracking on malicious input.
  2. Implicit Encoding: Assuming generic UTF-8 without stripping BOM (Byte Order Marks) or handling “mojibake” (garbled text: é instead of é).
  3. Library Drift: pandas str.lower() vs Python str.lower() vs C++ std::tolower vs Rust to_lowercase(). They mostly agree, but edge cases (like Turkish “I”) can cause divergences that invalidate model caches.
  4. Memory Bloat: Loading entire documents into memory for regex replacement instead of streaming.

The Foundation: Unicode Normalization

Text is just bytes, but valid text is a complex standard.

The Problem of Visual Equivalence

The character é can be represented as:

  • Composed (NFC): U+00E9 (One code point)
  • Decomposed (NFD): U+0065 (e) + U+0301 (acute accent)

To a human, they look identical. To a tokenizer, bytes("é") in NFC is [195, 169], while NFD is [101, 204, 129]. If your training data was NFC and your inference data is NFD, your embedding lookups will likely fail (UNK) or map to different vectors.

Rust Solution: unicode-normalization

In Rust, we enforce normalization explicitly.

[dependencies]
unicode-normalization = "0.1"
unicode-segmentation = "1.10"
#![allow(unused)]
fn main() {
use unicode_normalization::UnicodeNormalization;

/// Normalizes text to NFC form.
/// NFC is the standard for the Web (W3C Character Model) and most modern NLP models.
pub fn normalize_text(input: &str) -> String {
    // Standardize on NFC (Canonical Composition).
    // This is generally preferred for web text and standard tokenizers (like BERT).
    input.nfc().collect::<String>()
}

/// Compatibility Decomposition (NFKD)
/// Sometimes used for "brute force" search normalization, e.g., converting "ℍ" to "H".
/// Warning: This loses semantic meaning (e.g., "³" becomes "3").
pub fn aggressive_normalize(input: &str) -> String {
    input.nfkd().collect::<String>()
}

#[test]
fn test_equivalence() {
    let s1 = "\u{00E9}"; // é (NFC)
    let s2 = "\u{0065}\u{0301}"; // e + acute (NFD)
    
    assert_ne!(s1, s2); // Bytes are different
    assert_eq!(normalize_text(s1), normalize_text(s2)); // Normed strings are identical
}
}

MLOps Rule: Every entry point to your NLP system (API gateway, Kafka consumer) must apply NFC normalization before any other logic.

High-Performance Cleaning with Rust Regex

Python’s re module is feature-rich but can be slow and vulnerable to ReDoS (Regular Expression Denial of Service). Rust’s regex crate uses finite automata, guaranteeing linear time execution $O(n)$ with respect to the input size.

The Architecture of Rust Regex

The regex crate compiles patterns into a DFA (Deterministic Finite Automaton). This allows it to process text in a single pass without backtracking.

  • Trade-off: It does not support look-around (look-ahead/look-behind) because those features require backtracking.
  • Benefit: It is immune to ReDoS attacks. Even 1MB of “evil” input will be processed in linear time.

Cleaning Pipeline Example

Common tasks: removing URLs, stripping HTML tags, handling excessive whitespace.

#![allow(unused)]
fn main() {
use regex::Regex;
use lazy_static::lazy_static; // or use std::sync::OnceLock in newer Rust

lazy_static! {
    static ref URL_REGEX: Regex = Regex::new(r"https?://\S+").unwrap();
    static ref EMAIL_REGEX: Regex = Regex::new(r"[\w\.-]+@[\w\.-]+").unwrap();
    static ref HTML_TAGS: Regex = Regex::new(r"<[^>]*>").unwrap();
    static ref MULTI_SPACE: Regex = Regex::new(r"\s+").unwrap();
}

pub struct TextCleaner {
    strip_html: bool,
    normalize_whitespace: bool,
}

impl TextCleaner {
    pub fn new(strip_html: bool) -> Self {
        Self { 
            strip_html,
            normalize_whitespace: true 
        }
    }

    pub fn clean(&self, text: &str) -> String {
        let mut curr = text.to_string();

        if self.strip_html {
            curr = HTML_TAGS.replace_all(&curr, " ").to_string();
        }

        // Mask PII (example)
        curr = EMAIL_REGEX.replace_all(&curr, "<EMAIL>").to_string();
        curr = URL_REGEX.replace_all(&curr, "<URL>").to_string();

        if self.normalize_whitespace {
            // Trim and collapse multiple spaces to one
            curr = MULTI_SPACE.replace_all(&curr, " ").trim().to_string();
        }

        curr
    }
}
}

Memory Optimization Strategies for Text

String manipulation is allocation-heavy. Allocating a new String for every replacement in terabyte-scale logs is inefficient. Rust offers powerful abstractions to minimize copying.

Copy-on-Write (Cow<str>)

The Cow (Clone on Write) enum allows you to return the original string slice if no changes were needed, and only allocate a new String if a change actually occurred.

#![allow(unused)]
fn main() {
use std::borrow::Cow;

pub fn normalize_whitespace_cow(input: &str) -> Cow<str> {
    if !input.contains("  ") {
        return Cow::Borrowed(input);
    }
    // Allocation happens ONLY here
    let normalized = MULTI_SPACE.replace_all(input, " ");
    Cow::Owned(normalized.into_owned())
}
}

String Interning

If your dataset has many repeated strings (e.g., categorical labels, repetitive log headers), use interning. This stores the string once in a global pool and passes around a u32 symbol ID.

#![allow(unused)]
fn main() {
use string_interner::StringInterner;

pub struct Vocab {
    interner: StringInterner,
}

impl Vocab {
    pub fn get_id(&mut self, text: &str) -> u32 {
        self.interner.get_or_intern(text).to_usize() as u32
    }
}
}

SmallString Optimization

For short text fields (e.g., tags, usernames < 23 bytes), use smartstring or compact_str to store the string inline on the stack, bypassing heap allocation entirely.

Dealing with “Dirty” Text (OCR and ASR Errors)

Real-world text often comes from Optical Character Recognition (OCR) or Audio Speech Recognition (ASR), which introduces specific noise patterns.

Error Correction Heuristics

  • Visual Confusables: l (lower L) vs 1 (one) vs I (capital i).
  • OCR Splits: “exam ple” instead of “example”.

Rust Implementation using Edit Distance: For critical keyword matching, fuzzy search is safer than exact string match.

#![allow(unused)]
fn main() {
use strsim::levenshtein;

pub fn is_match(candidate: &str, target: &str, threshold: usize) -> bool {
    levenshtein(candidate, target) <= threshold
}

// Example: Fixing split words
// This requires a dictionary lookup which is fast with a HashSet or BloomFilter
pub fn merge_split_words(text: &str, dict: &HashSet<String>) -> String {
    let words: Vec<&str> = text.split_whitespace().collect();
    let mut out = Vec::new();
    let mut i = 0;
    while i < words.len() - 1 {
        let merged = format!("{}{}", words[i], words[i+1]);
        if dict.contains(&merged) {
            out.push(merged);
            i += 2;
        } else {
            out.push(words[i].to_string());
            i += 1;
        }
    }
    out.join(" ")
}
}

Truecasing: Restoring Case Information

ASR often outputs all-lowercase text. “the us president” -> “The US President”. A simple .title() is wrong (“The Us President”). Truecasing is a probabilistic problem.

Statistical Model Approach:

  1. Compute probability $P(c | w)$ (probability of casing $c$ given word $w$).
  2. Also consider $P(c_i | c_{i-1})$ (start of sentence is usually capitalized).
  3. Use Hidden Markov Model (HMM) or CRF to infer the sequence.

Rust Implementation (Simplified):

#![allow(unused)]
fn main() {
pub struct Truecaser {
    model: HashMap<String, String>, // "us" -> "US"
}

impl Truecaser {
    pub fn truecase(&self, text: &str) -> String {
        text.split_whitespace()
            .map(|w| self.model.get(&w.to_lowercase()).unwrap_or(&w.to_string()).clone())
            .collect::<Vec<_>>()
            .join(" ")
    }
}
}

Distributed Preprocessing: Rust DataFusion vs. Ray

When processing the Common Crawl or TB-scale implementations, simple loops for loops don’t cut it.

The Python Approach (Ray/Dask)

Complex serialization overhead. Pickling Python functions to workers is flexible but slow for CPU-bound string manipulation.

The Rust Approach (Polars / DataFusion)

Multithreaded vectorization.

  1. Polars: Excellent for single-node, large memory processing. Uses functionality similar to pandas but written in Rust / Arrow.
  2. DataFusion: Query engine that can execute cleaning as UDFs (User Defined Functions) over Parquet files.

Example: Polars String Expression

#![allow(unused)]
fn main() {
use polars::prelude::*;

fn clean_series(series: &Series) -> PolarsResult<Series> {
    let chunked = series.utf8()?;
    let out: Utf8Chunked = chunked.apply(|val| {
        // Call our Rust cleaner
        std::borrow::Cow::Owned(normalize_text(val))
    });
    Ok(out.into_series())
}
}

Streaming Preprocessing Pipelines in Rust

For datasets that do not fit in memory (e.g., 2TB JSONL files), you must stream. Rust’s tokio-stream and serde_json allow line-by-line processing with constant memory usage.

#![allow(unused)]
fn main() {
use tokio::fs::File;
use tokio::io::{BufReader, AsyncBufReadExt, AsyncWriteExt};
use serde_json::Value;

pub async fn stream_process(input_path: &str, output_path: &str) -> std::io::Result<()> {
    let input = File::open(input_path).await?;
    let reader = BufReader::new(input);
    let mut lines = reader.lines();

    let output = File::create(output_path).await?;
    let mut writer = tokio::io::BufWriter::new(output);

    while let Some(line) = lines.next_line().await? {
        if let Ok(mut json) = serde_json::from_str::<Value>(&line) {
            // Assume the text field is "content"
            if let Some(text) = json["content"].as_str() {
                let cleaned = normalize_text(text);
                json["content"] = Value::String(cleaned);
                
                let out_line = serde_json::to_string(&json).unwrap();
                writer.write_all(out_line.as_bytes()).await?;
                writer.write_all(b"\n").await?;
            }
        }
    }
    writer.flush().await?;
    Ok(())
}
}

Concurrency: You can combine this with tokio::spawn and channels to create a worker pool that processes chunks of lines in parallel while the main thread handles I/O.

Efficient Data Deduplication (MinHash LSH)

Training on duplicate data hurts model performance (memorization over generalization). Exact string matching is largely useless because of minor whitespace differences. We need fuzzy deduplication.

MinHash: A probabilistic data structure for estimating Jaccard similarity.

  1. Shingle the document (create n-grams).
  2. Hash each shingle with $K$ different hash functions.
  3. Keep the minimum hash value for each function.
  4. The signature is the vector of $K$ min-hashes.

Rust Implementation using gaec or manually:

#![allow(unused)]
fn main() {
use std::collections::hash_map::DefaultHasher;
use std::hash::{Hash, Hasher};

pub struct MinHash {
    num_hashes: usize,
    seeds: Vec<u64>,
}

impl MinHash {
    pub fn new(num_hashes: usize) -> Self {
        Self {
            num_hashes,
            seeds: (0..num_hashes).map(|i| i as u64).collect(), // simple seeds
        }
    }

    pub fn compute_signature(&self, shingles: &[&str]) -> Vec<u64> {
        let mut signature = vec![u64::MAX; self.num_hashes];

        for shingle in shingles {
            for i in 0..self.num_hashes {
                let mut hasher = DefaultHasher::new();
                shingle.hash(&mut hasher);
                self.seeds[i].hash(&mut hasher); // Mix in seed
                let hash = hasher.finish();
                if hash < signature[i] {
                    signature[i] = hash;
                }
            }
        }
        signature
    }

    pub fn jaccard_estimate(&self, sig_a: &[u64], sig_b: &[u64]) -> f64 {
        let matches = sig_a.iter().zip(sig_b).filter(|(a, b)| a == b).count();
        matches as f64 / self.num_hashes as f64
    }
}
}

MLOps Implementation at Scale: Calculate signatures during the ingestion phase. Store signatures in a vector database or a specialized LSH index (like FAISS or a Redis Bloom filter). Drop documents if jaccard_estimate > 0.9.

PII Redaction: Hybrid Approaches

General Data Protection Regulation (GDPR) and other laws require scrubbing Personally Identifiable Information (PII) before training.

The “Swiss Cheese” Problem

If you redact too aggressively (e.g., removing all names), the model loses context (“ met at ”). If you redact too loosely, you leak data.

Presidio in Rust (Concept)

Microsoft Presidio is the standard tool (Python/Go). In Rust, we build a pipeline of recognizers:

  1. Pattern Recognizers: Regexes for Email, Credit Cards (Luhn algorithm), SSN, Phone numbers.
  2. Model Recognizers: Fast NER models (ONNX Runtime in Rust) to detect Person/Location/Org.
  3. Context Enhancers: Looking for “Call me at…” before a number.
#![allow(unused)]
fn main() {
// Simple PII Scrubber trait
pub trait PiiScrubber {
    fn scrub(&self, text: &str) -> String;
}

pub struct RegexScrubber {
    emails: Regex,
    phones: Regex,
}

impl PiiScrubber for RegexScrubber {
    fn scrub(&self, text: &str) -> String {
        let t1 = self.emails.replace_all(text, "[EMAIL]");
        let t2 = self.phones.replace_all(&t1, "[PHONE]");
        t2.to_string()
    }
}
}

Language Identification

Before processing, you must know the language. Mixing languages in a monolingual pipeline causes massive noise.

Tools:

  • CLD2 / CLD3: Google’s Compact Language Detectors (C++ bindings).
  • Whatlang: A pure Rust library based on trigrams. Super fast, zero dependencies.
#![allow(unused)]
fn main() {
use whatlang::{detect, Lang, Script};

pub fn check_language(text: &str, target: Lang) -> bool {
    if let Some(info) = detect(text) {
        // High confidence check
        if info.lang() == target && info.confidence() > 0.8 {
            return true;
        }
    }
    false
}
}

MLOps Pipeline Integration: Filter/Route based on language ID.

  • En -> Pipeline A
  • Fr -> Pipeline B
  • Unknown -> Quarantine Bucket

Production Pipeline using tower::Service

To make our preprocessing pipeline robust and composable (like an HTTP stack), we can use the tower crate, which is the standard service abstraction in Rust.

#![allow(unused)]
fn main() {
use tower::{Service, ServiceBuilder};
use std::task::{Context, Poll};

// Define the Request/Response
struct TextRequest { content: String }
struct TextResponse { content: String }

// Middleware Layer: Normalization
struct NormalizationService<S> { inner: S }
impl<S> Service<TextRequest> for NormalizationService<S> 
where S: Service<TextRequest, Response=TextResponse> {
    type Response = S::Response;
    type Error = S::Error;
    type Future = S::Future;

    fn poll_ready(&mut self, cx: &mut Context<'_>) -> Poll<Result<(), Self::Error>> {
        self.inner.poll_ready(cx)
    }

    fn call(&mut self, mut req: TextRequest) -> Self::Future {
        req.content = normalize_text(&req.content);
        self.inner.call(req)
    }
}

// Middleware Layer: PII
// ... similar impl ...

// The final service
struct EchoService;
// ... returns the Request as Response ...

// Building the stack
fn build_pipeline() {
    let service = ServiceBuilder::new()
        .layer(tower::layer::layer_fn(|inner| NormalizationService { inner }))
        // .layer(PiiLayer)
        .service(EchoService);
}
}

Architectural Pattern: The “Transform Spec”

To ensure reproducibility, define your preprocessing pipeline as a serializable configuration (JSON/YAML), not just code.

# pipeline_config.yaml
steps:
  - op: "normalize_unicode"
    params: { form: "NFC" }
  - op: "regex_replace"
    params: { pattern: "https?://...", repl: "<URL>" }
  - op: "lowercase"
  - op: "strip_accents"
  - op: "pii_redact"
    params: { types: ["EMAIL", "PHONE"] }

Your Rust engine reads this config and constructs the pipeline dynamically. This allows you to A/B test different preprocessing strategies (e.g., keeping vs. removing punctuation) without recompiling the binary.

Handling Emojis

Emojis are semantic. Stripping them removes sentiment.

  • Recommendation: Use the emo crate or mappings to convert emojis to text (demojizing) if your tokenizer doesn’t support them well.
  • Example: 😊 -> :blush: or [EMOJI_HAPPY].

Troubleshooting Guide: Why is my Regex Slow?

If your preprocessing latency spikes, check:

  1. Recompilation: Are you compiling Regex::new() inside a loop? Compile it once (using once_cell or lazy_static).
  2. Backtracking (Python users): Does your regex have excessive wildcards .* or nested groups? In Rust this is fast, but if you have a massive NFA, memory might be high.
  3. Unicode: Regex operations on Unicode are slower. If you know inputs are ASCII, use regex::bytes::Regex for 2x speedup.

Summary

For NLP MLOps, preprocessing is strict ETL.

  1. Consistency: UTF-8 NFC always.
  2. Safety: Linear-time regexes.
  3. Reproducibility: Config-driven pipelines versioned with git.
  4. Scale: Streaming paradigms or Polars for throughput.
  5. Quality: Deduplication using MinHash is non-negotiable for LLM pre-training.
  6. Performance: Minimizing allocation via Cow<str> and SmallString.