Chapter 11: The Feature Store Architecture
11.2. AWS Implementation: SageMaker Feature Store
“Data is the new oil, but unrefined crude oil is useless. A Feature Store is the refinery that turns raw logs into high-octane fuel for your models, delivering it to the engine nozzle at 5 milliseconds latency.”
In the previous section, we established the fundamental architectural necessity of the Feature Store: solving the Online/Offline Skew. We defined the problem of training on historical batch data while serving predictions on live, single-row contexts.
Now, we descend from the theoretical clouds into the concrete reality of Amazon Web Services.
AWS SageMaker Feature Store is not a single database. It is a dual-store architecture that orchestrates data consistency between a high-throughput NoSQL layer (DynamoDB) and an immutable object storage layer (S3), bound together by a replication daemon and a metadata catalog (Glue).
For the Principal Engineer or Architect, treating SageMaker Feature Store as a “black box” is a recipe for cost overruns and latency spikes. You must understand the underlying primitives. This chapter dissects the service, exposing the wiring, the billing traps, the concurrency models, and the code patterns required to run it at scale.
5.2.1. The Dual-Store Architecture: Hot and Cold
The core value proposition of the SageMaker Feature Store is that it manages two conflicting requirements simultaneously:
- The Online Store (Hot Tier): Requires single-digit millisecond latency for
GetRecordoperations during inference. - The Offline Store (Cold Tier): Requires massive throughput for batch ingestion and “Time Travel” queries for training dataset construction.
Under the hood, AWS implements this using a Write-Through Caching pattern with asynchronous replication.
The Anatomy of a Write
When you call PutRecord via the API (or the Boto3 SDK), the following sequence occurs:
- Ingestion: The record hits the SageMaker Feature Store API endpoint.
- Validation: Schema validation occurs against the Feature Group definition.
- Online Write (Synchronous): The data is written to a managed Amazon DynamoDB table. This is the “Online Store.” The API call does not return
200 OKuntil this write is durable. - Replication (Asynchronous): An internal stream (invisible to the user, but conceptually similar to DynamoDB Streams) buffers the change.
- Offline Write (Batched): The buffered records are micro-batched and flushed to Amazon S3 in Parquet format (or Iceberg). This is the “Offline Store.”
- Catalog Sync: The AWS Glue Data Catalog is updated to recognize the new S3 partitions, making them queryable via Amazon Athena.
Architectural implication: The Online Store is strongly consistent (for the latest write). The Offline Store is eventually consistent. The replication lag is typically less than 5 minutes, but you cannot rely on the Offline Store for real-time analytics.
5.2.2. The Online Store: Managing the DynamoDB Backend
The Online Store is a managed DynamoDB table. However, unlike a raw DynamoDB table you provision yourself, you have limited control over its indexes. You control it primarily through the FeatureGroup configuration.
Identity and Time: The Two Pillars
Every record in the Feature Store is uniquely identified by a composite key composed of two required definitions:
- RecordIdentifierName: The Primary Key (PK). Examples:
user_id,session_id,product_id. - EventTimeFeatureName: The Timestamp (Sort Key context). This is strictly required to resolve collisions and enable time-travel.
Critical Anti-Pattern: Do not use “wall clock time” (processing time) for EventTimeFeatureName. You must use “event time” (when the event actually occurred). If you use processing time, and you backfill historical data, your feature store will think the historical data is “new” and overwrite your current state in the Online Store.
Throughput Modes and Cost
AWS offers two throughput modes for the Online Store, which directly map to DynamoDB capacity modes:
-
Provisioned Mode: You specify Read Capacity Units (RCU) and Write Capacity Units (WCU).
- Use Case: Predictable traffic (e.g., a batch job that runs every hour, or steady website traffic).
- Cost Risk: If you over-provision, you pay for idle capacity. If you under-provision, you get
ProvisionedThroughputExceededExceptionerrors, and your model inference fails.
-
On-Demand Mode: AWS scales the underlying table automatically.
- Use Case: Spiky traffic or new launches where load is unknown.
- Cost Risk: The cost per request is significantly higher than provisioned.
The Billing Mathematics: A “Write Unit” is defined as a write payload up to 1KB.
- If your feature vector is 1.1KB, you are charged for 2 Write Units.
- Optimization Strategy: Keep feature names short. Do not store massive JSON blobs or base64 images in the Feature Store. Store references (S3 URLs) instead.
Ttl (Time To Live) Management
A common form of technical debt is the “Zombie Feature.” A user visits your site once in 2021. Their feature record (last_clicked_category) sits in the Online Store forever, costing you storage fees every month.
The Fix: Enable TtlDuration in your Feature Group definition.
- AWS automatically deletes records from the Online Store after the TTL expires (e.g., 30 days).
- Crucially, this does not delete them from the Offline Store. You preserve the history for training (long-term memory) while keeping the inference cache (short-term memory) lean and cheap.
# Defining a Feature Group with Ttl and On-Demand Throughput
from sagemaker.feature_store.feature_group import FeatureGroup
feature_group = FeatureGroup(
name="customer-churn-features-v1",
sagemaker_session=sagemaker_session
)
feature_group.load_feature_definitions(data_frame=df)
feature_group.create(
s3_uri="s3://my-ml-bucket/feature-store/",
record_identifier_name="customer_id",
event_time_feature_name="event_timestamp",
role_arn=role,
enable_online_store=True,
online_store_config={
"TtlDuration": {"Unit": "Days", "Value": 90} # Evict after 90 days
}
)
5.2.3. The Offline Store: S3, Iceberg, and the Append-Only Log
The Offline Store is your system of record. It is an append-only log of every feature update that has ever occurred.
The Storage Structure
If you inspect the S3 bucket, you will see a hive-partitioned structure:
s3://bucket/AccountID/sagemaker/Region/OfflineStore/FeatureGroupName/data/year=2023/month=10/day=25/hour=12/...
This partitioning allows Athena to query specific time ranges efficiently, minimizing S3 scan costs.
The Apache Iceberg Evolution
Historically, SageMaker stored data in standard Parquet files. This created the “Small File Problem” (thousands of small files slowing down queries) and made complying with GDPR “Right to be Forgotten” (hard deletes) excruciatingly difficult.
In late 2023, AWS introduced Apache Iceberg table format support for Feature Store.
- ACID Transactions: Enables consistent reads and atomic updates on S3.
- Compaction: Automatically merges small files into larger ones for better read performance.
- Time Travel: Iceberg natively supports querying “as of” a snapshot ID.
Architectural Recommendation: For all new Feature Groups, enable Iceberg format. The operational headaches it solves regarding compaction and deletion are worth the migration.
# Enabling Iceberg format
offline_store_config = {
"S3StorageConfig": {
"S3Uri": "s3://my-ml-bucket/offline-store/"
},
"TableFormat": "Iceberg" # <--- The Modern Standard
}
5.2.4. Ingestion Patterns: The Pipeline Jungle
How do you get data into the store? There are three distinct architectural patterns, each with specific trade-offs.
Pattern A: The Streaming Ingest (Low Latency)
- Source: Clickstream data from Kinesis or Kafka.
- Compute: AWS Lambda or Flink (Kinesis Data Analytics).
- Mechanism: The consumer calls
put_record()for each event. - Pros: Features are available for inference immediately (sub-second).
- Cons: High cost (one API call per record). Throughput limits on the API.
Pattern B: The Micro-Batch Ingest (SageMaker Processing)
- Source: Daily dumps in S3 or Redshift.
- Compute: SageMaker Processing Job (Spark container).
- Mechanism: The Spark job transforms data and uses the
feature_store_pysparkconnector. - Pros: High throughput. Automatic multithreading.
- Cons: Latency (features are hours old).
Pattern C: The “Batch Load” API (The Highway)
AWS introduced the BatchLoad API to solve the slowness of put_record loops.
- Mechanism: You point the Feature Store at a CSV/Parquet file in S3. The management plane ingests it directly into the Offline Store and replicates to Online.
- Pros: Extremely fast, no client-side compute management.
Code: Robust Streaming Ingestion Wrapper
Do not just call put_record in a raw loop. You must handle ProvisionedThroughputExceededException with exponential backoff.
import boto3
import time
from botocore.exceptions import ClientError
sm_runtime = boto3.client("sagemaker-featurestore-runtime")
def robust_put_record(feature_group_name, record, max_retries=3):
"""
Ingests a record with exponential backoff for throttling.
record: List of dicts [{'FeatureName': '...', 'ValueAsString': '...'}]
"""
retries = 0
while retries < max_retries:
try:
sm_runtime.put_record(
FeatureGroupName=feature_group_name,
Record=record
)
return True
except ClientError as e:
error_code = e.response['Error']['Code']
if error_code in ['ThrottlingException', 'InternalFailure']:
wait_time = (2 ** retries) * 0.1 # 100ms, 200ms, 400ms...
time.sleep(wait_time)
retries += 1
else:
# Validation errors or non-transient issues -> Fail hard
raise e
raise Exception(f"Failed to ingest after {max_retries} retries")
5.2.5. Solving the “Point-in-Time” Correctness Problem
This is the most mathematically complex part of Feature Store architecture.
The Problem: You are training a fraud model to predict if a transaction at T=10:00 was fraudulent.
- At
10:00, the user’savg_transaction_amtwas $50. - At
10:05, the user made a huge transaction. - At
10:10, theavg_transaction_amtupdated to $500. - You are training the model today (next week). If you query the “current” value, you get $500. This is Data Leakage. You are using information from the future (
10:05) to predict the past (10:00).
The Solution: Point-in-Time (Time Travel) Queries. You must join your Label Table (List of transactions and timestamps) with the Feature Store such that for every row $i$, you retrieve the feature state at $t_{feature} \le t_{event}$.
AWS SageMaker provides a built-in method for this via the create_dataset API, which generates an Athena query under the hood.
# The "Time Travel" Query Construction
feature_group_query = feature_group.athena_query()
feature_group_table = feature_group_query.table_name
query_string = f"""
SELECT
T.transaction_id,
T.is_fraud as label,
T.event_time,
F.avg_transaction_amt,
F.account_age_days
FROM
"app_db"."transactions" T
LEFT JOIN
"{feature_group_table}" F
ON
T.user_id = F.user_id
AND F.event_time = (
SELECT MAX(event_time)
FROM "{feature_group_table}"
WHERE user_id = T.user_id
AND event_time <= T.event_time -- <--- THE MAGIC CLAUSE
)
"""
Architectural Note: The query above is conceptually what happens, but running MAX subqueries in Athena is expensive. The SageMaker SDK’s FeatureStore.create_dataset() method generates a more optimized (albeit uglier) SQL query using window functions (row_number() over (partition by ... order by event_time desc)).
5.2.6. Retrieval at Inference Time: The Millisecond Barrier
When your inference service (running on SageMaker Hosting or Lambda) needs features, it calls GetRecord.
Latency Budget:
- Network overhead (Lambda to DynamoDB): ~1-2ms.
- DynamoDB lookup: ~2-5ms.
- Deserialization: ~1ms.
- Total: ~5-10ms.
If you need features for multiple entities (e.g., ranking 50 items for a user), sequential GetRecord calls will kill your performance (50 * 10ms = 500ms).
Optimization: use BatchGetRecord.
This API allows you to retrieve records from multiple feature groups in parallel. Under the hood, it utilizes DynamoDB’s BatchGetItem.
# Batch Retrieval for Ranking
response = sm_runtime.batch_get_record(
Identifiers=[
{
'FeatureGroupName': 'user-features',
'RecordIdentifiersValueAsString': ['user_123']
},
{
'FeatureGroupName': 'item-features',
'RecordIdentifiersValueAsString': ['item_A', 'item_B', 'item_C']
}
]
)
# Result is a single JSON payload with all vectors
5.2.7. Advanced Topic: Handling Embeddings and Large Vectors
With the rise of GenAI, engineers often try to shove 1536-dimensional embeddings (from OpenAI text-embedding-3-small) into the Feature Store.
The Constraint: The Online Store (DynamoDB) has a hard limit of 400KB per item. A float32 vector of dimension 1536 is: $1536 \times 4 \text{ bytes} \approx 6 \text{ KB}$. This fits easily.
However, if you try to store chunks of text context alongside the vector, you risk hitting the limit.
Architectural Pattern: The “Hybrid Pointer” If the payload exceeds 400KB (e.g., long document text):
- Store the large text in S3.
- Store the S3 URI and the Embedding Vector in the Feature Store.
- Inference: The model consumes the vector directly. If it needs the text (for RAG), it fetches from S3 asynchronously.
New Feature Alert: As of late 2023, SageMaker Feature Store supports Vector data types directly. This allows integration with k-NN (k-Nearest Neighbors) search, effectively turning the Feature Store into a lightweight Vector Database. However, for massive scale vector search (millions of items), dedicated services like OpenSearch Serverless (Chapter 22) are preferred.
5.2.8. Security, Governance, and Lineage
In regulated environments (FinTech, HealthTech), the Feature Store is a critical audit point.
Encryption
- At Rest: The Online Store (DynamoDB) and Offline Store (S3) must be encrypted using AWS KMS. Use a Customer Managed Key (CMK) for granular access control.
- In Transit: TLS 1.2+ is enforced by AWS endpoints.
Fine-Grained Access Control (FGAC)
You can use IAM policies to restrict access to specific features.
- Scenario: The Marketing team can read
user_demographicsfeatures. The Risk team can readcredit_scorefeatures. - Implementation: Resource-based tags on the Feature Group and IAM conditions.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Deny",
"Action": "sagemaker:GetRecord",
"Resource": "arn:aws:sagemaker:*:*:feature-group/*",
"Condition": {
"StringEquals": {
"aws:ResourceTag/sensitivity": "high"
}
}
}
]
}
Lineage Tracking
Because Feature Store integrates with SageMaker Lineage Tracking, every model trained using a dataset generated from the Feature Store automatically links back to the specific query and data timeframe used. This answers the auditor’s question: “What exact data state caused the model to deny this loan on Feb 14th?”
5.2.9. Operational Realities and “Gotchas”
1. The Schema Evolution Trap
Unlike a SQL database, you cannot easily ALTER TABLE to change a column type from String to Integral.
- Reality: Feature Groups are immutable regarding schema types.
- Workaround: Create
feature-group-v2, backfill data, and switch the application pointer. This is why abstracting feature retrieval behind a generic API service (rather than direct SDK calls in app code) is crucial.
2. The “Eventual Consistency” of the Offline Store
Do not write a test that puts a record and immediately queries Athena to verify it. The replication lag to S3 can be 5-15 minutes.
- Testing Pattern: For integration tests, query the Online Store to verify ingestion. Use the Offline Store only for batch capabilities.
3. Cross-Account Access
A common enterprise pattern is:
- Data Account: Holds the S3 buckets and Feature Store.
- Training Account: Runs SageMaker Training Jobs.
- Serving Account: Runs Lambda functions.
This requires complex IAM Trust Policies and Bucket Policies. The Feature Store requires the sagemaker-featurestore-execution-role to have permissions on the S3 bucket in the other account. It also requires the KMS key policy to allow cross-account usage. “Access Denied” on the Offline Store replication is the most common setup failure.
5.2.11. Real-World Case Study: Fintech Fraud Detection
Company: SecureBank (anonymized)
Challenge: Fraud detection model with 500ms p99 latency requirement, processing 10,000 transactions/second.
Initial Architecture (Failed):
# Naive implementation - calling Feature Store for each transaction
def score_transaction(transaction_id, user_id, merchant_id):
# Problem: 3 sequential API calls
user_features = sm_runtime.get_record(
FeatureGroupName='user-features',
RecordIdentifierValueAsString=user_id
) # 15ms
merchant_features = sm_runtime.get_record(
FeatureGroupName='merchant-features',
RecordIdentifierValueAsString=merchant_id
) # 15ms
transaction_features = compute_transaction_features(transaction_id) # 5ms
# Total: 35ms per transaction
prediction = model.predict([user_features, merchant_features, transaction_features])
return prediction
# At 10k TPS: 35ms * 10,000 = 350 seconds of compute per second
# IMPOSSIBLE - need 350 parallel instances minimum
Optimized Architecture:
# Solution 1: Batch retrieval
def score_transaction_batch(transactions):
"""Process batch of 100 transactions at once"""
# Collect all entity IDs
user_ids = [t['user_id'] for t in transactions]
merchant_ids = [t['merchant_id'] for t in transactions]
# Single batch call
response = sm_runtime.batch_get_record(
Identifiers=[
{
'FeatureGroupName': 'user-features',
'RecordIdentifiersValueAsString': user_ids
},
{
'FeatureGroupName': 'merchant-features',
'RecordIdentifiersValueAsString': merchant_ids
}
]
)
# Total: 20ms for 100 transactions = 0.2ms per transaction
# At 10k TPS: Only need ~100 parallel instances
# Build feature matrix
feature_matrix = []
for txn in transactions:
user_feat = extract_features(response, 'user-features', txn['user_id'])
merch_feat = extract_features(response, 'merchant-features', txn['merchant_id'])
txn_feat = compute_transaction_features(txn)
feature_matrix.append(user_feat + merch_feat + txn_feat)
# Batch prediction
predictions = model.predict(feature_matrix)
return predictions
# Solution 2: Local caching for hot entities
from cachetools import TTLCache
import threading
class CachedFeatureStore:
def __init__(self, ttl_seconds=60, max_size=100000):
self.cache = TTLCache(maxsize=max_size, ttl=ttl_seconds)
self.lock = threading.Lock()
# Metrics
self.hits = 0
self.misses = 0
def get_features(self, feature_group, entity_id):
cache_key = f"{feature_group}:{entity_id}"
# Check cache
with self.lock:
if cache_key in self.cache:
self.hits += 1
return self.cache[cache_key]
# Cache miss - fetch from Feature Store
self.misses += 1
features = sm_runtime.get_record(
FeatureGroupName=feature_group,
RecordIdentifierValueAsString=entity_id
)
# Update cache
with self.lock:
self.cache[cache_key] = features
return features
def get_hit_rate(self):
total = self.hits + self.misses
return self.hits / total if total > 0 else 0
# For top 1% of users (power users), cache hit rate = 95%
# Effective latency: 0.95 * 0ms + 0.05 * 15ms = 0.75ms
Results:
- P99 latency: 45ms → 8ms (82% improvement)
- Infrastructure cost: $12k/month → $3k/month (75% reduction)
- False positive rate: 2.3% → 1.8% (better features available faster)
5.2.12. Advanced Patterns: Streaming Feature Updates
For ultra-low latency requirements, waiting for batch materialization is too slow.
Pattern: Direct Write from Kinesis
import boto3
import json
from datetime import datetime
kinesis = boto3.client('kinesis')
sm_runtime = boto3.client('sagemaker-featurestore-runtime')
def process_kinesis_stream():
"""
Lambda function triggered by Kinesis stream
Computes features in real-time and writes to Feature Store
"""
def lambda_handler(event, context):
for record in event['Records']:
# Decode event
payload = json.loads(record['kinesis']['data'])
# Example: User clicked an ad
user_id = payload['user_id']
ad_id = payload['ad_id']
timestamp = payload['timestamp']
# Compute real-time features
# "Number of clicks in last 5 minutes"
recent_clicks = count_recent_clicks(user_id, minutes=5)
# Write to Feature Store immediately
feature_record = [
{'FeatureName': 'user_id', 'ValueAsString': user_id},
{'FeatureName': 'event_time', 'ValueAsString': str(timestamp)},
{'FeatureName': 'clicks_last_5min', 'ValueAsString': str(recent_clicks)},
{'FeatureName': 'last_ad_clicked', 'ValueAsString': ad_id}
]
try:
sm_runtime.put_record(
FeatureGroupName='user-realtime-features',
Record=feature_record
)
except Exception as e:
# Log but don't fail - eventual consistency is acceptable
print(f"Failed to write feature: {e}")
return {'statusCode': 200}
Pattern: Feature Store + ElastiCache Dual-Write
For absolute minimum latency (<1ms), bypass Feature Store for reads:
import redis
import json
class HybridFeatureStore:
"""
Writes to both Feature Store (durability) and Redis (speed)
Reads from Redis first, falls back to Feature Store
"""
def __init__(self):
self.redis = redis.StrictRedis(
host='my-cache.cache.amazonaws.com',
port=6379,
db=0,
decode_responses=True
)
self.sm_runtime = boto3.client('sagemaker-featurestore-runtime')
def write_features(self, feature_group, entity_id, features):
"""Dual-write pattern"""
# 1. Write to Feature Store (durable, auditable)
record = [
{'FeatureName': 'entity_id', 'ValueAsString': entity_id},
{'FeatureName': 'event_time', 'ValueAsString': str(datetime.now().timestamp())}
]
for key, value in features.items():
record.append({'FeatureName': key, 'ValueAsString': str(value)})
self.sm_runtime.put_record(
FeatureGroupName=feature_group,
Record=record
)
# 2. Write to Redis (fast retrieval)
redis_key = f"{feature_group}:{entity_id}"
self.redis.setex(
redis_key,
3600, # 1 hour TTL
json.dumps(features)
)
def read_features(self, feature_group, entity_id):
"""Read from Redis first, fallback to Feature Store"""
# Try Redis first (sub-millisecond)
redis_key = f"{feature_group}:{entity_id}"
cached = self.redis.get(redis_key)
if cached:
return json.loads(cached)
# Fallback to Feature Store
response = self.sm_runtime.get_record(
FeatureGroupName=feature_group,
RecordIdentifierValueAsString=entity_id
)
# Parse and cache
features = {f['FeatureName']: f['ValueAsString']
for f in response['Record']}
# Warm cache for next request
self.redis.setex(redis_key, 3600, json.dumps(features))
return features
5.2.13. Monitoring and Alerting
CloudWatch Metrics to Track:
import boto3
from datetime import datetime, timedelta
cloudwatch = boto3.client('cloudwatch')
def publish_feature_store_metrics(feature_group_name):
"""
Custom metrics for Feature Store health
"""
# 1. Ingestion lag
# Time between event_time and write_time
response = sm_runtime.get_record(
FeatureGroupName=feature_group_name,
RecordIdentifierValueAsString='sample_user_123'
)
event_time = float(response['Record'][1]['ValueAsString'])
current_time = datetime.now().timestamp()
lag_seconds = current_time - event_time
cloudwatch.put_metric_data(
Namespace='FeatureStore',
MetricData=[
{
'MetricName': 'IngestionLag',
'Value': lag_seconds,
'Unit': 'Seconds',
'Dimensions': [
{'Name': 'FeatureGroup', 'Value': feature_group_name}
]
}
]
)
# 2. Feature completeness
# Percentage of entities with non-null values
null_count = count_null_features(feature_group_name)
total_count = count_total_records(feature_group_name)
completeness = 100 * (1 - null_count / total_count)
cloudwatch.put_metric_data(
Namespace='FeatureStore',
MetricData=[
{
'MetricName': 'FeatureCompleteness',
'Value': completeness,
'Unit': 'Percent',
'Dimensions': [
{'Name': 'FeatureGroup', 'Value': feature_group_name}
]
}
]
)
# 3. Online/Offline consistency
# Sample check of 100 random entities
consistency_rate = check_online_offline_consistency(feature_group_name, sample_size=100)
cloudwatch.put_metric_data(
Namespace='FeatureStore',
MetricData=[
{
'MetricName': 'ConsistencyRate',
'Value': consistency_rate,
'Unit': 'Percent',
'Dimensions': [
{'Name': 'FeatureGroup', 'Value': feature_group_name}
]
}
]
)
# CloudWatch Alarms
def create_feature_store_alarms():
"""
Set up alerts for Feature Store issues
"""
# Alarm 1: High ingestion lag
cloudwatch.put_metric_alarm(
AlarmName='FeatureStore-HighIngestionLag',
ComparisonOperator='GreaterThanThreshold',
EvaluationPeriods=2,
MetricName='IngestionLag',
Namespace='FeatureStore',
Period=300,
Statistic='Average',
Threshold=300.0, # 5 minutes
ActionsEnabled=True,
AlarmActions=['arn:aws:sns:us-east-1:123456789012:ml-alerts'],
AlarmDescription='Feature Store ingestion lag exceeds 5 minutes'
)
# Alarm 2: Low feature completeness
cloudwatch.put_metric_alarm(
AlarmName='FeatureStore-LowCompleteness',
ComparisonOperator='LessThanThreshold',
EvaluationPeriods=3,
MetricName='FeatureCompleteness',
Namespace='FeatureStore',
Period=300,
Statistic='Average',
Threshold=95.0, # 95%
ActionsEnabled=True,
AlarmActions=['arn:aws:sns:us-east-1:123456789012:ml-alerts'],
AlarmDescription='Feature completeness below 95%'
)
# Alarm 3: Online/Offline inconsistency
cloudwatch.put_metric_alarm(
AlarmName='FeatureStore-Inconsistency',
ComparisonOperator='LessThanThreshold',
EvaluationPeriods=2,
MetricName='ConsistencyRate',
Namespace='FeatureStore',
Period=300,
Statistic='Average',
Threshold=99.0, # 99%
ActionsEnabled=True,
AlarmActions=['arn:aws:sns:us-east-1:123456789012:ml-alerts'],
AlarmDescription='Feature Store consistency below 99%'
)
5.2.14. Cost Optimization Strategies
Strategy 1: TTL-Based Eviction
Don’t pay for stale features:
# Configure TTL when creating Feature Group
feature_group.create(
s3_uri=f"s3://{bucket}/offline-store/",
record_identifier_name="user_id",
event_time_feature_name="event_time",
role_arn=role,
enable_online_store=True,
online_store_config={
"TtlDuration": {
"Unit": "Days",
"Value": 30 # Evict after 30 days of inactivity
}
}
)
# For a system with 100M users:
# - Active users (30 days): 20M users * 10KB = 200GB
# - Without TTL: 100M users * 10KB = 1,000GB
# Cost savings: (1000 - 200) * $1.25/GB = $1,000/month
Strategy 2: Tiered Feature Storage
class TieredFeatureAccess:
"""
Hot features: Online Store (expensive, fast)
Warm features: Offline Store + cache (medium cost, medium speed)
Cold features: S3 only (cheap, slow)
"""
def __init__(self):
self.online_features = ['clicks_last_hour', 'current_session_id']
self.warm_features = ['total_lifetime_value', 'account_age_days']
self.cold_features = ['historical_purchases_archive']
def get_features(self, user_id, required_features):
results = {}
# Hot path: Online Store
hot_needed = [f for f in required_features if f in self.online_features]
if hot_needed:
hot_data = sm_runtime.get_record(
FeatureGroupName='hot-features',
RecordIdentifierValueAsString=user_id
)
results.update(hot_data)
# Warm path: Query Athena (cached)
warm_needed = [f for f in required_features if f in self.warm_features]
if warm_needed:
warm_data = query_athena_cached(user_id, warm_needed)
results.update(warm_data)
# Cold path: Direct S3 read (rare)
cold_needed = [f for f in required_features if f in self.cold_features]
if cold_needed:
cold_data = read_from_s3_archive(user_id, cold_needed)
results.update(cold_data)
return results
Strategy 3: Provisioned Capacity for Predictable Load
# Instead of On-Demand, use Provisioned for cost savings
# Calculate required capacity
peak_rps = 10000 # requests per second
avg_feature_size_kb = 5
read_capacity_units = peak_rps * (avg_feature_size_kb / 4) # DynamoDB RCU
# Provision with buffer
provisioned_rcu = int(read_capacity_units * 1.2) # 20% buffer
# Cost comparison:
# On-Demand: $1.25 per million reads = $1.25 * 10000 * 3600 * 24 * 30 / 1M = $32,400/month
# Provisioned: $0.00013 per RCU-hour * provisioned_rcu * 730 hours = ~$1,900/month
# Savings: 94%!
# Update Feature Group to use provisioned capacity
# Note: This requires recreating the Feature Group
5.2.15. Disaster Recovery and Backup
Pattern: Cross-Region Replication
# Setup: Replicate Offline Store across regions
def setup_cross_region_replication(source_bucket, dest_bucket, dest_region):
"""
Enable S3 cross-region replication for Offline Store
"""
s3 = boto3.client('s3')
replication_config = {
'Role': 'arn:aws:iam::123456789012:role/S3ReplicationRole',
'Rules': [
{
'ID': 'ReplicateFeatureStore',
'Status': 'Enabled',
'Priority': 1,
'Filter': {'Prefix': 'offline-store/'},
'Destination': {
'Bucket': f'arn:aws:s3:::{dest_bucket}',
'ReplicationTime': {
'Status': 'Enabled',
'Time': {'Minutes': 15}
},
'Metrics': {
'Status': 'Enabled',
'EventThreshold': {'Minutes': 15}
}
}
}
]
}
s3.put_bucket_replication(
Bucket=source_bucket,
ReplicationConfiguration=replication_config
)
# Backup: Point-in-time snapshot
def create_feature_store_snapshot(feature_group_name, snapshot_date):
"""
Create immutable snapshot of Feature Store state
"""
athena = boto3.client('athena')
# Query all features at specific point in time
query = f"""
CREATE TABLE feature_snapshots.{feature_group_name}_{snapshot_date}
WITH (
format='PARQUET',
external_location='s3://backups/snapshots/{feature_group_name}/{snapshot_date}/'
) AS
SELECT *
FROM (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY event_time DESC
) as rn
FROM "sagemaker_featurestore"."{feature_group_name}"
WHERE event_time <= TIMESTAMP '{snapshot_date} 23:59:59'
)
WHERE rn = 1
"""
response = athena.start_query_execution(
QueryString=query,
ResultConfiguration={'OutputLocation': 's3://athena-results/'}
)
return response['QueryExecutionId']
# Restore procedure
def restore_from_snapshot(snapshot_path, target_feature_group):
"""
Restore Feature Store from snapshot
"""
# 1. Load snapshot data
df = pd.read_parquet(snapshot_path)
# 2. Batch write to Feature Store
for chunk in np.array_split(df, len(df) // 1000):
records = []
for _, row in chunk.iterrows():
record = [
{'FeatureName': col, 'ValueAsString': str(row[col])}
for col in df.columns
]
records.append(record)
# Batch ingestion
for record in records:
sm_runtime.put_record(
FeatureGroupName=target_feature_group,
Record=record
)
5.2.16. Best Practices Summary
- Use Batch Retrieval: Always prefer
batch_get_recordover sequentialget_recordcalls - Enable TTL: Don’t pay for inactive users’ features indefinitely
- Monitor Lag: Track ingestion lag and alert if > 5 minutes
- Cache Strategically: Use ElastiCache for hot features
- Provision Wisely: Use Provisioned Capacity for predictable workloads
- Test Point-in-Time: Verify training data has no data leakage
- Version Features: Use Feature Group versions for schema evolution
- Replicate Offline Store: Enable cross-region replication for DR
- Optimize Athena: Partition and compress Offline Store data
- Audit Everything: Log all feature retrievals for compliance
5.2.17. Troubleshooting Guide
| Issue | Symptoms | Solution |
|---|---|---|
| High latency | p99 > 100ms | Use batch retrieval, add caching |
ThrottlingException | Sporadic failures | Increase provisioned capacity or use exponential backoff |
| Features not appearing | Get returns empty | Check ingestion pipeline, verify event_time |
| Offline Store lag | Athena queries stale | Replication can take 5-15 min, check CloudWatch |
| Schema mismatch | Validation errors | Features are immutable, create new Feature Group |
| High costs | Bill increasing | Enable TTL, use tiered storage, optimize queries |
5.2.18. Exercises
Exercise 1: Latency Optimization
Benchmark get_record vs batch_get_record for 100 entities. Measure:
- Total time
- P50, P95, P99 latencies
- Cost per 1M requests
Exercise 2: Cost Analysis Calculate monthly cost for your workload:
- 50M users
- 15KB average feature vector
- 1000 RPS peak
- Compare On-Demand vs Provisioned
Exercise 3: Disaster Recovery Implement and test:
- Backup procedure
- Restore procedure
- Measure RTO and RPO
Exercise 4: Monitoring Dashboard Create CloudWatch dashboard showing:
- Ingestion lag
- Online/Offline consistency
- Feature completeness
- Error rates
Exercise 5: Point-in-Time Verification Write test that:
- Creates synthetic event stream
- Generates training data
- Verifies no data leakage
5.2.19. Summary
AWS SageMaker Feature Store provides a managed dual-store architecture that solves the online/offline skew problem through:
Key Capabilities:
- Dual Storage: DynamoDB (online, low-latency) + S3 (offline, historical)
- Point-in-Time Correctness: Automated time-travel queries via Athena
- Integration: Native SageMaker Pipelines and Glue Catalog support
- Security: KMS encryption, IAM controls, VPC endpoints
Cost Structure:
- Online Store: ~$1.25/GB-month
- Offline Store: ~$0.023/GB-month
- Write requests: ~$1.25 per million
Best Use Cases:
- Real-time inference with <10ms requirements
- Compliance requiring audit trails
- Teams already in AWS ecosystem
- Need for point-in-time training data
Avoid When:
- Batch-only inference (use S3 directly)
- Extremely high throughput (>100k RPS without caching)
- Need for complex relational queries
Critical Success Factors:
- Batch retrieval for performance
- TTL for cost control
- Monitoring for consistency
- Caching for ultra-low latency
- DR planning for reliability
In the next section, we will explore the Google Cloud Platform equivalent—Vertex AI Feature Store—which takes a radically different architectural approach by relying on Bigtable and BigQuery.