Building Production GenAI Systems in Regulated Industries: A Technical Guide

After spending years building and operating GenAI systems in regulated industries like fintech, I've learned that the path from proof-of-concept to production is filled with challenges that have nothing to do with model performance. This guide shares the architecture patterns, monitoring strategies, and compliance considerations that actually matter when deploying GenAI at scale.
The Production Reality Gap
Most GenAI demos are impressive. Most production deployments face these realities:
- Hallucinations happen - Even the best models produce incorrect outputs
- Latency varies - From 1s to minutes depending on load and complexity
- Costs scale non-linearly - Token usage can spike unpredictably
- Compliance is mandatory - Every decision needs an audit trail
- Security is paramount - PII, trade secrets, and sensitive data flow through prompts
In regulated environments, these aren't just engineering challenges - they're business and legal requirements.
Core Architecture Principles
Separation of Concerns
The temptation with GenAI is to throw everything into a single function: "user input goes in, LLM output comes out." This works for demos but fails in production. Each responsibility (preprocessing, prompt construction, LLM calls, validation, logging) should be isolated and testable independently.
Your GenAI application should be decomposed into clear layers:
class GenAIClassifier:
async def classify(self, input_data, context):
# 1. Preprocess and sanitize input
processed = self.preprocessor.process(input_data)
# 2. Build prompt with automatic version tracking
prompt = self.prompt_builder.build(processed)
# 3. Call LLM with error handling
try:
response = await self.llm_client.generate(prompt)
except (TimeoutError, RateLimitError) as e:
return self.fallback_handler.handle(input_data, e)
# 4. Validate output
validated = self.validator.validate(response)
# 5. Log for audit
self.audit_logger.log({
"prompt_version": self.prompt_builder.version,
"output": validated,
"latency_ms": response.latency
})
return validatedPrompt Management as Code
In my experience, prompts change more frequently than code. Product managers want to tweak wording. Compliance teams need to add disclaimers. Testing reveals edge cases that require prompt adjustments. Hardcoding prompts into Python strings creates friction and makes collaboration difficult.
Store prompts in YAML files for easier management and collaboration:
# prompts/classifier.yaml
name: document_classifier
description: Classification prompt for financial documents
system_message: |
You are a classification system for financial documents.
Your task is to categorize documents into exactly one of these categories:
- INVOICE
- CONTRACT
- REPORT
- OTHER
Requirements:
- Output ONLY the category name
- If uncertain, output OTHER
- Do not include explanations
user_template: |
Document: {document_text}
parameters:
temperature: 0.1
max_tokens: 50import hashlib
import yaml
class PromptLoader:
def load(self, prompt_name: str):
with open(f"prompts/{prompt_name}.yaml") as f:
content = f.read()
config = yaml.safe_load(content)
# Auto-generate version from content hash
config['version'] = hashlib.sha256(content.encode()).hexdigest()[:8]
return config
def build_messages(self, prompt_name: str, variables: dict):
config = self.load(prompt_name)
return [
{"role": "system", "content": config['system_message']},
{"role": "user", "content": config['user_template'].format(**variables)}
]Benefits of YAML-based prompts:
- Non-developers can review and edit prompts
- Clean diffs in version control
- Metadata and parameters alongside the prompt
- Automatic versioning from file content hash
- Easy A/B testing with multiple YAML files
Defense in Depth
LLMs are probabilistic. Even with temperature set to 0, they can produce unexpected outputs. Schema changes, hallucinations, or adversarial inputs can all bypass simple validation. Your validation strategy should assume that every layer might fail and implement multiple independent checks.
Modern production systems implement defense in depth using guardrails - programmatic policies that enforce safety, compliance, and quality constraints on both inputs and outputs in real-time:
class DefenseInDepthValidator:
def __init__(self):
self.pii_detector = PIIDetector()
self.toxicity_checker = ToxicityChecker()
self.schema_validator = SchemaValidator(ClassificationResult)
async def validate(self, user_input: str, llm_output: str) -> dict:
# Layer 1: Input validation - detect and redact PII
if self.pii_detector.contains_pii(user_input):
user_input = self.pii_detector.redact(user_input)
# Layer 2: Safety check - block toxic content
if self.toxicity_checker.is_toxic(user_input):
raise SafetyError("Input contains toxic content")
# Layer 3: Schema validation - ensure output structure
validated_output = self.schema_validator.validate(llm_output)
# Layer 4: Business rules validation
if not self.passes_business_rules(validated_output):
raise ValidationError("Business rule validation failed")
# Layer 5: Compliance checks
if not self.passes_compliance_checks(validated_output):
raise ComplianceError("Output violates compliance rules")
return validated_outputKey guardrail categories for regulated environments:
| Layer | Purpose | Example Tools |
|---|---|---|
| Safety | Block toxic/harmful content | Guardrails AI, Llama Guard, NeMo Guardrails |
| Privacy | Prevent PII leakage | Presidio, custom validators |
| Compliance | Enforce regulatory constraints | Custom validators, domain rules |
| Quality | Ensure accuracy/relevance | FactualConsistency checks |
| Schema | Validate structure | Pydantic, JSON Schema |
Example compliance guardrail for financial services:
class FinancialComplianceGuardrail:
"""Prevent outputs that could constitute unauthorized financial advice."""
PROHIBITED_PHRASES = [
"you should invest",
"guaranteed returns",
"risk-free investment"
]
def validate(self, output: str) -> tuple[bool, str]:
for phrase in self.PROHIBITED_PHRASES:
if phrase.lower() in output.lower():
self.metrics.increment("compliance_violation")
return False, f"Prohibited financial advice detected"
return True, outputStructured Outputs
Free-form text responses are unreliable for production systems. When you need to extract specific fields or make decisions based on LLM output, parsing unstructured text is fragile and error-prone. Modern LLM APIs support JSON mode and function calling, which constrain outputs to valid schemas1.
from pydantic import BaseModel
from openai import OpenAI
class ClassificationResult(BaseModel):
category: str
confidence: float
reasoning: str
client = OpenAI()
def classify_with_schema(document_text: str) -> ClassificationResult:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Classify the document."},
{"role": "user", "content": document_text}
],
response_format=ClassificationResult
)
return ClassificationResult.model_validate_json(response.choices[0].message.content)Benefits of structured outputs:
- Guaranteed valid JSON - No parsing failures from malformed responses
- Type safety - Pydantic validation catches schema violations
- Better prompting - The schema itself guides the model's output
- Simplified code - No regex or string parsing needed
For providers without native structured output support, libraries like Instructor2 provide similar functionality across different LLM APIs.
Building for Reliability
In production, LLM APIs will fail. Rate limits get hit during traffic spikes. Upstream services degrade. Latency increases under load. Your system needs to gracefully handle these failures without cascading to dependent services or degrading the entire application.
Circuit Breakers
When an LLM API starts failing, continuing to send requests makes things worse. Circuit breakers detect repeated failures and stop making requests temporarily, giving the upstream service time to recover while protecting your application from timeout accumulation. This pattern, popularized by Michael Nygard in "Release It!"3, is essential for resilient distributed systems.
Prevent cascade failures when external services (LLM APIs) degrade:
from circuitbreaker import circuit
class LLMClient:
@circuit(failure_threshold=5, recovery_timeout=60)
async def generate(self, messages, **kwargs):
return await self.api_client.chat.completions.create(
messages=messages, **kwargs
)Graceful Degradation
When the LLM is unavailable, returning an error to the user is often the wrong choice. Depending on your use case, you might have rule-based fallbacks, simplified prompts, or human review queues. The key is ensuring your application continues to provide value even when the AI component fails.
Have meaningful fallbacks ready:
class FallbackHandler:
def handle(self, input_data, error):
if isinstance(error, TimeoutError):
return self.try_simplified_prompt(input_data)
elif isinstance(error, RateLimitError):
return self.rule_based_classifier(input_data)
else:
return self.queue_for_human_review(input_data)Multi-Provider Redundancy
Relying on a single LLM provider is a single point of failure. Provider outages, rate limits, and regional issues can take down your entire application. A multi-provider strategy ensures continuity when your primary provider has issues.
class MultiProviderLLMClient:
def __init__(self):
self.providers = [
{"name": "openai", "client": OpenAIClient(), "priority": 1}, # Primary
{"name": "anthropic", "client": AnthropicClient(), "priority": 2}, # Fallback
{"name": "azure", "client": AzureOpenAIClient(), "priority": 3}, # Last resort
]
async def generate(self, messages: list, **kwargs):
for provider in sorted(self.providers, key=lambda x: x["priority"]):
try:
# Returns immediately on success - no other providers called
return await provider["client"].generate(messages, **kwargs)
except (RateLimitError, ServiceUnavailableError):
# Only falls back to next provider on specific failures
self.metrics.increment(f"provider_fallback_{provider['name']}")
continue # Try next provider
raise AllProvidersFailedError("All LLM providers unavailable")Considerations for multi-provider setups:
- Prompt compatibility - Different providers may need slightly different prompts
- Response normalization - Standardize response formats across providers
- Cost implications - Fallback providers may have different pricing
Semantic Caching
LLM API calls are expensive and slow. Many production workloads have significant query overlap where similar questions should return similar answers. Semantic caching uses embeddings to identify similar queries and return cached responses, reducing costs by 30-70% in high-traffic scenarios4.
import hashlib
import numpy as np
from redis import Redis
class SemanticCache:
def __init__(self, similarity_threshold: float = 0.95):
self.redis = Redis()
self.threshold = similarity_threshold
self.embedding_client = EmbeddingClient()
async def get_or_generate(self, query: str, generate_fn):
# Generate embedding for the query
query_embedding = await self.embedding_client.embed(query)
# Check for semantically similar cached queries
cached = await self.find_similar(query_embedding)
if cached:
self.metrics.increment("cache_hit")
return cached["response"]
# Generate new response
response = await generate_fn(query)
# Cache with embedding for future similarity matching
await self.store(query, query_embedding, response)
return response
async def find_similar(self, embedding: list) -> dict | None:
# Use vector similarity search (Redis VSS, Pinecone, etc.)
results = await self.redis.ft_search(
embedding, k=1, score_threshold=self.threshold
)
return results[0] if results else NoneFor simpler use cases, exact-match caching with hashed prompts can also provide significant benefits:
def get_cache_key(messages: list, model: str) -> str:
content = json.dumps({"messages": messages, "model": model}, sort_keys=True)
return hashlib.sha256(content.encode()).hexdigest()Streaming Responses
For user-facing applications, waiting 5-10 seconds for a complete response creates poor user experience. Streaming delivers tokens as they're generated, reducing perceived latency from seconds to milliseconds. Users see immediate feedback while the full response generates.
async def stream_response(messages: list):
stream = await client.chat.completions.create(
model="gpt-4o",
messages=messages,
stream=True
)
full_response = ""
async for chunk in stream:
if chunk.choices[0].delta.content:
token = chunk.choices[0].delta.content
full_response += token
yield token # Send to client immediately
# Log complete response for audit
await audit_logger.log({"response": full_response})Streaming considerations:
- Audit logging - Buffer the complete response for compliance logging
- Validation - Can only validate after stream completes
- Error handling - Streams can fail mid-response
- Token counting - Track usage from stream metadata, not content length
Monitoring and Observability
Traditional application metrics (CPU, memory, requests/second) are necessary but insufficient for GenAI systems. You need metrics specific to LLM behavior: token consumption (which drives cost), output quality degradation, and prompt-level performance.
Without proper observability, you won't know when prompt changes degrade accuracy, when costs spike due to inefficient prompts, or when latency impacts user experience. Every LLM request should be instrumented.
Key Metrics to Track
| Metric Category | What to Track | Why It Matters |
|---|---|---|
| Performance | Latency (p50, p95, p99) | User experience |
| Quality | Validation failure rate | Model drift detection |
| Cost | Tokens per request | Budget control |
| Reliability | Circuit breaker state | Service health |
| Business | Human review queue size | Operational load |
from prometheus_client import Counter, Histogram, Gauge
# Track requests, latency, and token usage
llm_requests_total = Counter('llm_requests_total', 'Total requests', ['status'])
llm_latency = Histogram('llm_latency_seconds', 'Request latency')
llm_tokens = Histogram('llm_tokens_used', 'Tokens consumed')
# Track quality and business metrics
validation_failures = Counter('validation_failures', 'Failed validations')
review_queue_size = Gauge('review_queue_size', 'Items awaiting review')LLM-Specific Observability Tools
While Prometheus handles infrastructure metrics, LLM-specific observability tools like Langfuse, LangSmith, or Phoenix provide trace-level visibility into prompt execution. You can see exactly which prompts are slow, which produce validation failures, and how changes to prompts affect quality over time.
from langfuse import Langfuse
langfuse = Langfuse() # Reads LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY from env
async def classify_with_tracing(input_data: dict):
trace = langfuse.trace(
name="document-classification",
metadata={
"prompt_version": prompt_builder.version,
"code_version": os.getenv("GIT_COMMIT_SHA", "unknown")[:8]
}
)
generation = trace.generation(name="classify", input=messages)
response = await llm_client.generate(messages)
generation.end(output=response.content, usage={
"input_tokens": response.usage.prompt_tokens,
"output_tokens": response.usage.completion_tokens
})
return responseAlerting
Alerts should trigger on conditions that indicate degraded service or compliance risk. High latency affects user experience. Validation failures suggest model drift or prompt issues. Circuit breakers opening indicate upstream problems that need immediate attention.
Set up alerts for:
# Example Prometheus alert rules
groups:
- name: genai_reliability
rules:
- alert: HighLLMLatency
expr: histogram_quantile(0.95, llm_latency_seconds) > 10
for: 5m
annotations:
summary: "95th percentile LLM latency above 10s"
- alert: HighValidationFailureRate
expr: rate(output_validation_failures_total[5m]) > 0.1
for: 10m
annotations:
summary: "More than 10% of outputs failing validation"
- alert: CircuitBreakerOpen
expr: circuit_breaker_state{service="llm"} == 1
for: 2m
annotations:
summary: "LLM circuit breaker is open"Compliance and Auditability
In regulated industries, "the model said so" is not an acceptable explanation. Auditors, compliance teams, and regulators need to understand why a system made a particular decision. This requires complete traceability: what input was received, which prompt version was used, what the model returned, and how it was validated.
Complete Audit Trails
Audit logs serve multiple purposes: debugging production issues, regulatory compliance, and improving models with production data. The key is capturing enough information to reconstruct any decision while respecting data retention and privacy requirements.
Every request needs to be traceable:
class LLMRequestAudit(Base):
__tablename__ = 'llm_request_audit'
id = Column(Integer, primary_key=True)
timestamp = Column(DateTime, nullable=False)
request_id = Column(String, unique=True)
# Track inputs, prompts, model, and outputs
input_hash = Column(String)
prompt_version = Column(String)
model_name = Column(String)
output_hash = Column(String)
# Performance and compliance
latency_ms = Column(Integer)
tokens_used = Column(Integer)
data_retention_until = Column(DateTime)
pii_detected = Column(Boolean)PII Detection and Redaction
Sending customer PII to third-party LLM APIs creates both privacy risks and compliance liabilities. In the EU, GDPR Article 32 requires organizations to implement "appropriate technical and organizational measures" to protect personal data5. Detecting and redacting PII before it reaches the model is essential, especially in industries like finance and healthcare where regulations like GDPR (EU), CCPA (California, US), or HIPAA (US) apply.
PII Categories and Risk Levels
Different types of PII carry different levels of risk and regulatory requirements. Understanding these categories helps you apply appropriate protections:
| PII Category | Examples | Risk Level | Regulatory Impact |
|---|---|---|---|
| Direct Identifiers | Name, SSN, Passport number | High | GDPR Art. 4(1) (EU), CCPA (California) |
| Financial Data | Credit card, bank account, IBAN | Critical | PCI DSS (global), PSD2 (EU) |
| Contact Information | Email, phone, address | Medium | GDPR (EU), CAN-SPAM (US) |
| Biometric Data | Fingerprints, facial recognition | Critical | GDPR Art. 9 (EU), BIPA (Illinois) |
| Health Information | Medical records, diagnoses | Critical | HIPAA (US), GDPR Art. 9 (EU) |
| Quasi-Identifiers | ZIP code + age + gender | Medium | Can re-identify when combined |
| Sensitive Attributes | Race, religion, political views | High | GDPR Art. 9 (EU), various state laws (US) |
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
class PIIHandler:
# Organize entities by risk level
CRITICAL_ENTITIES = ["CREDIT_CARD", "IBAN_CODE", "MEDICAL_LICENSE", "US_SSN"]
HIGH_RISK_ENTITIES = ["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER"]
MEDIUM_RISK_ENTITIES = ["LOCATION", "DATE_TIME", "IP_ADDRESS"]
def __init__(self):
self.analyzer = AnalyzerEngine()
self.anonymizer = AnonymizerEngine()
def detect_and_redact(self, text: str, risk_threshold: str = "medium") -> tuple[str, list]:
# Select entities based on risk threshold
entities_to_detect = []
if risk_threshold in ["critical", "high", "medium"]:
entities_to_detect.extend(self.CRITICAL_ENTITIES)
if risk_threshold in ["high", "medium"]:
entities_to_detect.extend(self.HIGH_RISK_ENTITIES)
if risk_threshold == "medium":
entities_to_detect.extend(self.MEDIUM_RISK_ENTITIES)
results = self.analyzer.analyze(
text=text,
entities=entities_to_detect,
language="en"
)
# Use different anonymization strategies based on PII type
anonymized = self.anonymizer.anonymize(
text=text,
analyzer_results=results,
operators={
"CREDIT_CARD": {"type": "mask", "masking_char": "*", "chars_to_mask": 12},
"EMAIL_ADDRESS": {"type": "hash"},
"PERSON": {"type": "replace", "new_value": "[PERSON]"}
}
)
return anonymized.text, resultsData Retention
Different types of data require different retention periods based on regulatory requirements and business needs. Under the EU's 5th Anti-Money Laundering Directive, financial institutions must retain customer due diligence data for at least five years6. Automated retention policies ensure compliance without manual intervention.
Compliance Reminder: Retention periods vary by jurisdiction and industry. Always consult your legal team for specific requirements. The examples below are illustrative only.
from datetime import datetime, timedelta
class AuditLogger:
# Example retention periods - adjust based on your jurisdiction and use case
# EU AML directive: 5 years; Tax records: varies by country (6-10 years)
RETENTION_DAYS = {"financial": 1825, "standard": 730, "operational": 365}
def log(self, audit_data: dict):
retention = self.RETENTION_DAYS.get(audit_data.get("classification"), 730)
audit_data['data_retention_until'] = datetime.now() + timedelta(days=retention)
self.db.add(LLMRequestAudit(**audit_data))
self.db.commit()Data Residency
When using third-party LLM APIs, your data travels to their infrastructure. For regulated industries, this raises critical questions: Where is the data processed? Is it stored? Who can access it? In the EU, GDPR requires that personal data be processed with adequate protections, which may restrict transfers to certain jurisdictions7. Similar data localization requirements exist in other regions (e.g., China's PIPL, Russia's data localization law).
Key considerations for LLM data residency:
| Concern | Question to Answer | Mitigation |
|---|---|---|
| Processing Location | Where are API servers located? | Use regional endpoints (Azure OpenAI EU, etc.) |
| Data Retention | Does the provider store prompts/responses? | Review provider data policies, opt out of training |
| Subprocessors | Who else handles the data? | Review provider's subprocessor list |
| Cross-border Transfer | Does data leave your jurisdiction? | Use providers with local presence |
class DataResidencyValidator:
ALLOWED_REGIONS = ["eu-west-1", "eu-central-1"] # EU only
def validate_provider(self, provider_config: dict) -> bool:
if provider_config["region"] not in self.ALLOWED_REGIONS:
raise DataResidencyError(
f"Provider region {provider_config['region']} not in allowed regions"
)
if not provider_config.get("data_processing_agreement"):
raise ComplianceError("DPA required for this provider")
return TrueFor highly sensitive data, consider self-hosted models or on-premise solutions that keep data entirely within your infrastructure.
AI Risk Classification
The EU AI Act introduces a risk-based regulatory framework that categorizes AI systems by their potential impact on fundamental rights and safety8. Understanding where your GenAI system falls in this classification determines your compliance obligations.
EU AI Act Risk Categories
| Risk Level | Definition | Examples | Key Requirements |
|---|---|---|---|
| Unacceptable | Prohibited systems | Social scoring, emotion recognition in workplaces, real-time biometric surveillance | Banned in EU |
| High-Risk | Significant impact on rights/safety | Employment decisions, credit scoring, law enforcement, critical infrastructure | Mandatory conformity assessment, documentation, human oversight |
| Limited Risk | Transparency concerns | Chatbots, emotion recognition (opt-in), deepfakes | Transparency obligations (disclose AI use) |
| Minimal Risk | Low impact | Spam filters, AI-enabled games, recommendation systems | No specific obligations |
GenAI systems in regulated industries typically fall under High-Risk when used for:
- Credit decisions - Loan approvals, credit scoring (EU AI Act Annex III, Point 5(b))
- Employment - Recruitment screening, performance evaluation (Annex III, Point 4)
- Essential services - Healthcare diagnostics, insurance underwriting
- Law enforcement - Risk assessment tools
High-risk systems require conformity assessments, human oversight, and comprehensive documentation before deployment.
High-Risk System Compliance Checklist
If your GenAI system is classified as high-risk, you must implement:
-
Risk Management System
- Identify and analyze known/foreseeable risks
- Implement mitigation measures
- Test with representative data
- Monitor throughout lifecycle
-
Data Governance
- Training data quality and relevance
- Bias detection in datasets
- Data protection measures (GDPR compliance in EU, applicable privacy laws elsewhere)
-
Technical Documentation
- System design and architecture
- Model cards with performance metrics
- Intended use and limitations
- Human oversight measures
-
Transparency Requirements
- Users must be informed they're interacting with AI
- Provide clear information on system capabilities and limitations
- Document decision-making logic
-
Human Oversight
- Humans can override AI decisions
- Ability to interrupt system operation
- Humans understand system capabilities
class HighRiskAICompliance:
"""Enforce EU AI Act requirements for high-risk systems."""
def __init__(self):
self.risk_register = []
self.oversight_enabled = True
async def process_decision(self, input_data: dict, context: dict) -> dict:
# Document that AI is being used (transparency requirement)
self.log_ai_disclosure(context["user_id"])
# Generate AI recommendation
ai_decision = await self.ai_model.predict(input_data)
# High-risk systems require human oversight
if self.requires_human_review(ai_decision):
ai_decision["status"] = "pending_human_review"
ai_decision["human_override_available"] = True
await self.queue_for_human_review(ai_decision)
else:
# Log decision with full traceability
await self.audit_logger.log({
"decision_id": ai_decision["id"],
"input_hash": self.hash_input(input_data),
"output": ai_decision,
"model_version": self.ai_model.version,
"risk_assessment": self.assess_risk(ai_decision),
"human_override_available": True
})
return ai_decision
def requires_human_review(self, decision: dict) -> bool:
"""Determine if human review is required."""
return (
decision.get("confidence") < 0.85 or
decision.get("risk_score") > 0.7 or
decision.get("contradicts_rules", False)
)Compliance Reminder: The EU AI Act applies to providers placing AI systems on the EU market and deployers using AI systems in the EU, regardless of where the provider is established. If you serve EU customers or have EU operations, these requirements likely apply to you.
Model Risk Management
In heavily regulated industries like banking, model risk management (MRM) frameworks apply to GenAI systems. In the US, the Federal Reserve's SR 11-7 guidance9 establishes expectations for model development, implementation, and use. Similar frameworks exist in other jurisdictions (e.g., EBA guidelines in the EU, PRA SS1/23 in the UK). While originally written for traditional statistical models, regulators increasingly expect these principles to apply to AI/ML systems.
Key MRM requirements for GenAI:
- Model Documentation - Document the model's purpose, design, limitations, and assumptions
- Independent Validation - Have the model reviewed by parties not involved in development
- Ongoing Monitoring - Track model performance and drift over time
- Change Management - Formal processes for prompt and model changes
class ModelRiskDocumentation:
def generate_model_card(self, model_config: dict) -> dict:
return {
"model_name": model_config["name"],
"version": model_config["version"],
"intended_use": model_config["purpose"],
"limitations": [
"May hallucinate facts",
"Performance varies with input length",
"Not validated for languages other than English"
],
"training_data": "Third-party foundation model - training data not disclosed",
"evaluation_metrics": self.get_latest_evaluation_results(),
"risk_rating": self.calculate_risk_tier(model_config),
"approved_by": model_config.get("approval_record"),
"next_review_date": self.calculate_review_date(model_config)
}Bias and Fairness
LLMs can perpetuate or amplify biases present in their training data. In regulated contexts like lending, hiring, or insurance, biased outputs can violate fair treatment regulations. In the EU, the AI Act classifies AI systems used in employment, credit, and essential services as high-risk, requiring bias testing and documentation8. In the US, the EEOC and CFPB have issued guidance on AI fairness in employment and lending decisions.
class FairnessEvaluator:
PROTECTED_ATTRIBUTES = ["gender", "race", "age", "disability"]
async def evaluate_bias(self, model, test_cases: list) -> dict:
results = {"overall_parity": True, "disparities": []}
for attribute in self.PROTECTED_ATTRIBUTES:
# Test with demographic variations
group_results = await self.test_across_groups(model, test_cases, attribute)
# Check for statistical parity
disparity = self.calculate_disparity(group_results)
if disparity > self.DISPARITY_THRESHOLD:
results["overall_parity"] = False
results["disparities"].append({
"attribute": attribute,
"disparity_ratio": disparity,
"details": group_results
})
return resultsBias mitigation strategies:
- Test across demographics - Evaluate outputs for different demographic groups
- Prompt engineering - Include fairness instructions in system prompts
- Output filtering - Flag or block potentially discriminatory outputs
- Human review - Route high-stakes decisions through human oversight
- Regular audits - Periodic fairness evaluations with updated test sets
Testing Production GenAI
Testing GenAI systems is fundamentally different from testing traditional software. The LLM itself is non-deterministic, but the components around it (prompt loaders, validators, fallback handlers) are fully testable. The key is knowing what to test and how.
Testing Strategy Overview
| Test Type | What to Test | Deterministic? | Execution Speed |
|---|---|---|---|
| Unit Tests | Prompt loading, validation logic, parsers | Yes | Fast (ms) |
| Integration Tests | Component interaction with mocked LLM | Yes | Fast (ms) |
| Evaluation Pipeline | Golden datasets, LLM-as-judge, regression detection, A/B testing | No | Slow (min) |
Unit Tests for Deterministic Components
While you can't reliably unit test LLM outputs, you can test everything else: prompt construction, output parsing, validation logic, and error handling. These tests run fast and catch regressions in your infrastructure code.
def test_prompt_loader():
loader = PromptLoader()
messages = loader.build_messages("classifier", {"document_text": "Test doc"})
assert "classification system" in messages[0]["content"]
assert len(loader.load("classifier")['version']) == 8
def test_output_validator():
with pytest.raises(ValidationError):
OutputValidator().validate("INVALID_CATEGORY")Integration Tests with LLM Mocking
Integration tests verify that components work together correctly without relying on actual LLM API calls. Mocking LLM responses lets you test error handling, validation logic, and business workflows deterministically.
@pytest.mark.asyncio
async def test_classification_flow():
mock_response = MockLLMResponse(content="INVOICE")
with patch('llm_client.generate', return_value=mock_response):
result = await classifier.classify({"text": "Invoice #12345"})
assert result["category"] == "INVOICE"Building a Robust Evaluation Pipeline
Production GenAI systems need continuous, multi-dimensional evaluation. Unlike traditional ML where accuracy on a test set is often sufficient, LLM outputs require evaluating multiple quality dimensions: correctness, relevance, coherence, safety, and task-specific criteria.
Golden Datasets
The foundation of any evaluation setup is a curated dataset with known correct answers. Golden datasets let you measure accuracy, detect regressions when prompts change, and compare different models or approaches systematically.
async def evaluate_on_golden_set(classifier, golden_set_path: str) -> dict:
test_data = pd.read_csv(golden_set_path)
predictions = []
for _, row in test_data.iterrows():
result = await classifier.classify({"text": row["text"]})
predictions.append(result["category"])
report = classification_report(
test_data["expected_category"], predictions, output_dict=True
)
return {
"accuracy": report["accuracy"],
"per_class_metrics": report,
"num_samples": len(test_data)
}Tips for building golden datasets:
- Start small, grow iteratively - Begin with 50-100 high-quality examples, expand based on production edge cases
- Include edge cases - Add examples that have caused issues in production
- Version your datasets - Track changes as you add or modify examples
- Balance across categories - Ensure sufficient coverage of all expected output types
Evaluation Dimensions
| Dimension | What It Measures | Evaluation Method |
|---|---|---|
| Correctness | Factual accuracy | Ground truth comparison, fact-checking |
| Relevance | Response addresses the query | LLM-as-judge, semantic similarity |
| Coherence | Logical flow, readability | LLM-as-judge, human review |
| Safety | No harmful/toxic content | Guardrail checks, toxicity classifiers |
| Consistency | Same input → similar output | Multiple runs, variance analysis |
| Latency | Response time | Percentile tracking (p50, p95, p99) |
LLM-as-Judge for Subjective Metrics
For dimensions like relevance and coherence, human evaluation doesn't scale. LLM-as-judge uses a separate model to evaluate outputs against defined criteria. This approach, while not perfect, correlates well with human judgments when prompts are carefully designed10.
class LLMEvaluator:
EVALUATION_PROMPT = """
Evaluate the following response on a scale of 1-5 for each criterion.
Query: {query}
Response: {response}
Criteria:
- Relevance: Does the response address the query?
- Accuracy: Is the information factually correct?
- Completeness: Does it cover all aspects of the query?
Return JSON: {"relevance": int, "accuracy": int, "completeness": int, "reasoning": str}
"""
async def evaluate(self, query: str, response: str) -> dict:
result = await self.judge_model.generate(
self.EVALUATION_PROMPT.format(query=query, response=response)
)
return json.loads(result)Tip: Use a different model family for evaluation than for generation to avoid self-preference bias. If you generate with GPT-4o, consider evaluating with Claude, or vice versa.
Continuous Evaluation Pipeline
Evaluation shouldn't be a one-time activity. Set up automated pipelines that run on every prompt change, model update, or on a regular schedule to catch regressions early.
class EvaluationPipeline:
def __init__(self):
self.metrics_store = MetricsStore()
self.alert_threshold = 0.05 # 5% regression threshold
async def run_evaluation(self, prompt_version: str) -> dict:
# Load evaluation dataset
eval_data = self.load_eval_dataset()
results = {
"prompt_version": prompt_version,
"timestamp": datetime.now().isoformat(),
"metrics": {}
}
# Run predictions
predictions = await self.batch_predict(eval_data)
# Calculate metrics across dimensions
results["metrics"]["accuracy"] = self.calculate_accuracy(predictions, eval_data)
results["metrics"]["avg_latency_ms"] = self.calculate_latency(predictions)
results["metrics"]["llm_judge_scores"] = await self.run_llm_evaluation(predictions)
# Compare against baseline
baseline = self.metrics_store.get_baseline()
regressions = self.detect_regressions(results["metrics"], baseline)
if regressions:
await self.alert_team(f"Regression detected: {regressions}")
results["status"] = "regression_detected"
else:
results["status"] = "passed"
self.metrics_store.save(results)
return results
def detect_regressions(self, current: dict, baseline: dict) -> list:
regressions = []
for metric, value in current.items():
if isinstance(value, (int, float)):
baseline_value = baseline.get(metric, value)
if value < baseline_value * (1 - self.alert_threshold):
regressions.append(f"{metric}: {baseline_value:.2f} → {value:.2f}")
return regressionsA/B Testing Prompts
When iterating on prompts, A/B testing lets you compare variants on real traffic before fully rolling out changes. This catches issues that evaluation datasets miss and provides statistical confidence in improvements.
class PromptABTest:
def __init__(self, control_prompt: str, variant_prompt: str, traffic_split: float = 0.1):
self.control = control_prompt
self.variant = variant_prompt
self.variant_traffic = traffic_split # 10% to variant
async def get_prompt(self, request_id: str) -> tuple[str, str]:
# Deterministic assignment based on request_id
bucket = int(hashlib.md5(request_id.encode()).hexdigest(), 16) % 100
if bucket < self.variant_traffic * 100:
return self.variant, "variant"
return self.control, "control"
async def analyze_results(self) -> dict:
control_metrics = await self.metrics_store.get_metrics(group="control")
variant_metrics = await self.metrics_store.get_metrics(group="variant")
return {
"control": control_metrics,
"variant": variant_metrics,
"improvement": self.calculate_significance(control_metrics, variant_metrics)
}Key principles for robust evaluation:
- Evaluate before deploying - Block deployments that fail evaluation thresholds
- Track trends over time - Single-point metrics miss gradual degradation
- Combine automated and human review - Use humans for edge cases and calibration
- Version everything - Link evaluation results to specific prompt and model versions
Common Pitfalls
Key Insight: The following pitfalls are based on real production deployments. Avoiding these can save months of rework and significant cost.
Over-reliance on Model Performance
Problem: Focusing only on accuracy while ignoring latency, cost, and reliability.
I've seen teams celebrate 95% accuracy while ignoring that their p95 latency is 15 seconds or that token costs are 10x their budget. In production, a system that's 90% accurate, costs 100 USD/month, and responds in 2 seconds is often better than one that's 95% accurate, costs 2,000 USD/month, and takes 8 seconds.
Solution: Track composite metrics that matter to the business:
# Business-level SLA
sla_score = (
0.4 * accuracy + # Correctness matters
0.3 * (1 - p95_latency/max_latency) + # Speed matters
0.2 * availability + # Uptime matters
0.1 * (1 - cost/budget) # Cost matters
)Ignoring Prompt Injection Risks
Problem: Users can manipulate outputs through carefully crafted inputs.
Prompt injection attacks, where users craft inputs to manipulate LLM behavior, are an active area of security research. The OWASP Top 10 for LLM Applications lists prompt injection as the #1 risk11. Unlike SQL injection, there's no perfect defense - prompts and user inputs exist in the same context space.
Security Note: There is no foolproof defense against prompt injection. String pattern matching and input sanitization are easily bypassed. Focus on architectural defenses and output validation.
Solution: Defense in depth approach - no single mitigation is sufficient, but layering defenses reduces risk:
Note: Prompt injection is an evolving threat. Research from Simon Willison and others demonstrates that string pattern matching is insufficient12. Focus on:
- Proper role separation (system vs user messages)
- Output validation and structured outputs
- Monitoring for anomalous behavior
- Human review for high-stakes decisions
No Feedback Loop
Problem: No mechanism to improve the system based on production data.
Your model's accuracy on a test set is interesting. Its accuracy on real user queries is what matters. Without collecting feedback from production use (whether explicit thumbs up/down or implicit signals like user corrections), you're flying blind. Production data reveals edge cases your test set missed.
Solution: Implement human feedback collection:
class FeedbackCollector:
async def collect(self, request_id: str, feedback: dict):
await self.db.feedbacks.insert_one({
"request_id": request_id,
"rating": feedback["rating"],
"correct": feedback.get("correct")
})
if await self.calculate_recent_accuracy() < 0.85:
await self.trigger_retraining()Underestimating Costs
Problem: Token usage spirals out of control in production.
What costs 5 dollars in development can cost 5,000 dollars in production when you're processing thousands of requests per day. Long prompts, verbose outputs, and unnecessary API calls add up quickly. Cost monitoring needs to be first-class, not an afterthought.
Solution: Implement cost tracking and budgets:
class CostController:
async def check_budget(self, user_id: str):
usage = await self.get_monthly_usage(user_id)
budget = await self.get_user_budget(user_id)
if usage >= budget * 0.9:
await self.send_alert(user_id, "approaching_budget")
if usage >= budget:
raise BudgetExceededError(f"Budget exceeded: {usage}/{budget}")Conclusion
Building production GenAI systems in regulated industries requires more than prompt engineering. It demands a comprehensive approach across architecture, reliability, observability, compliance, and testing.
Key Takeaways
| Priority | Focus Area | Why It Matters |
|---|---|---|
| Architecture | Separation of concerns, structured outputs, YAML prompts | Maintainability at scale |
| Reliability | Circuit breakers, multi-provider, caching, streaming | Service availability |
| Observability | Metrics, tracing, alerts | Detect issues early |
| Compliance | Audit logs, PII handling, data residency, bias testing | Legal requirements |
| Risk Management | Model documentation, validation, change control | Regulatory expectations |
| Testing & Evaluation | Unit tests, integration, LLM-as-judge, A/B testing, regression detection | Confidence in changes |
| Cost Control | Token tracking, caching, budgets | Financial sustainability |
The excitement of GenAI capabilities should never overshadow the fundamentals of production systems: reliability, observability, and security.
If you're building GenAI systems in regulated environments, focus on making the boring stuff excellent. The AI will only be as valuable as the infrastructure supporting it.
-
OpenAI. (2025). "Structured Outputs." https://platform.openai.com/docs/guides/structured-outputs ↩
-
Instructor. (2025). "Structured outputs powered by LLMs." https://github.com/jxnl/instructor ↩
-
Nygard, M. (2018). "Release It! Design and Deploy Production-Ready Software" (2nd ed.). Pragmatic Bookshelf. ↩
-
Zilliz. (2025). "Semantic Cache: A Guide to Cache Optimization for LLMs." https://zilliz.com/learn/semantic-cache ↩
-
European Parliament and Council. (2016). "General Data Protection Regulation (GDPR) - Article 32: Security of processing." https://gdpr-info.eu/art-32-gdpr/ ↩
-
European Parliament and Council. (2018). "Directive (EU) 2018/843 - Fifth Anti-Money Laundering Directive." https://eur-lex.europa.eu/eli/dir/2018/843/oj ↩
-
European Parliament and Council. (2016). "General Data Protection Regulation (GDPR) - Chapter V: Transfers of personal data to third countries." https://gdpr-info.eu/chapter-5/ ↩
-
European Parliament and Council. (2024). "Regulation (EU) 2024/1689 - Artificial Intelligence Act." https://eur-lex.europa.eu/eli/reg/2024/1689/oj ↩ ↩2
-
Board of Governors of the Federal Reserve System (US). (2011). "SR 11-7: Guidance on Model Risk Management." https://www.federalreserve.gov/supervisionreg/srletters/sr1107.htm ↩
-
Zheng, L., et al. (2023). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." NeurIPS 2023. https://arxiv.org/abs/2306.05685 ↩
-
OWASP Foundation. (2025). "OWASP Top 10 for Large Language Model Applications." https://genai.owasp.org/ ↩
-
Willison, S. (2023-2025). "Prompt injection: What's the worst that can happen?" and ongoing research. https://simonwillison.net/series/prompt-injection/ ↩