Back to blog
Technical

Building Production GenAI Systems in Regulated Industries: A Technical Guide

|
29 min read
#genai#llm#production-ai#ai-compliance#llmops#regulated-industries#prompt-engineering
Sergio Valmorisco

Sergio Valmorisco

GenAI Engineering Lead

Specializing in production AI systems for regulated industries, with experience deploying enterprise-scale GenAI applications in fintech and compliance workflows.

After spending years building and operating GenAI systems in regulated industries like fintech, I've learned that the path from proof-of-concept to production is filled with challenges that have nothing to do with model performance. This guide shares the architecture patterns, monitoring strategies, and compliance considerations that actually matter when deploying GenAI at scale.

The Production Reality Gap

Most GenAI demos are impressive. Most production deployments face these realities:

  • Hallucinations happen - Even the best models produce incorrect outputs
  • Latency varies - From 1s to minutes depending on load and complexity
  • Costs scale non-linearly - Token usage can spike unpredictably
  • Compliance is mandatory - Every decision needs an audit trail
  • Security is paramount - PII, trade secrets, and sensitive data flow through prompts

In regulated environments, these aren't just engineering challenges - they're business and legal requirements.

Core Architecture Principles

Separation of Concerns

The temptation with GenAI is to throw everything into a single function: "user input goes in, LLM output comes out." This works for demos but fails in production. Each responsibility (preprocessing, prompt construction, LLM calls, validation, logging) should be isolated and testable independently.

Loading diagram...

Your GenAI application should be decomposed into clear layers:

Python
class GenAIClassifier: async def classify(self, input_data, context): # 1. Preprocess and sanitize input processed = self.preprocessor.process(input_data) # 2. Build prompt with automatic version tracking prompt = self.prompt_builder.build(processed) # 3. Call LLM with error handling try: response = await self.llm_client.generate(prompt) except (TimeoutError, RateLimitError) as e: return self.fallback_handler.handle(input_data, e) # 4. Validate output validated = self.validator.validate(response) # 5. Log for audit self.audit_logger.log({ "prompt_version": self.prompt_builder.version, "output": validated, "latency_ms": response.latency }) return validated

Prompt Management as Code

In my experience, prompts change more frequently than code. Product managers want to tweak wording. Compliance teams need to add disclaimers. Testing reveals edge cases that require prompt adjustments. Hardcoding prompts into Python strings creates friction and makes collaboration difficult.

Store prompts in YAML files for easier management and collaboration:

YAML
# prompts/classifier.yaml name: document_classifier description: Classification prompt for financial documents system_message: | You are a classification system for financial documents. Your task is to categorize documents into exactly one of these categories: - INVOICE - CONTRACT - REPORT - OTHER Requirements: - Output ONLY the category name - If uncertain, output OTHER - Do not include explanations user_template: | Document: {document_text} parameters: temperature: 0.1 max_tokens: 50
Python
import hashlib import yaml class PromptLoader: def load(self, prompt_name: str): with open(f"prompts/{prompt_name}.yaml") as f: content = f.read() config = yaml.safe_load(content) # Auto-generate version from content hash config['version'] = hashlib.sha256(content.encode()).hexdigest()[:8] return config def build_messages(self, prompt_name: str, variables: dict): config = self.load(prompt_name) return [ {"role": "system", "content": config['system_message']}, {"role": "user", "content": config['user_template'].format(**variables)} ]

Benefits of YAML-based prompts:

  • Non-developers can review and edit prompts
  • Clean diffs in version control
  • Metadata and parameters alongside the prompt
  • Automatic versioning from file content hash
  • Easy A/B testing with multiple YAML files

Defense in Depth

LLMs are probabilistic. Even with temperature set to 0, they can produce unexpected outputs. Schema changes, hallucinations, or adversarial inputs can all bypass simple validation. Your validation strategy should assume that every layer might fail and implement multiple independent checks.

Modern production systems implement defense in depth using guardrails - programmatic policies that enforce safety, compliance, and quality constraints on both inputs and outputs in real-time:

Python
class DefenseInDepthValidator: def __init__(self): self.pii_detector = PIIDetector() self.toxicity_checker = ToxicityChecker() self.schema_validator = SchemaValidator(ClassificationResult) async def validate(self, user_input: str, llm_output: str) -> dict: # Layer 1: Input validation - detect and redact PII if self.pii_detector.contains_pii(user_input): user_input = self.pii_detector.redact(user_input) # Layer 2: Safety check - block toxic content if self.toxicity_checker.is_toxic(user_input): raise SafetyError("Input contains toxic content") # Layer 3: Schema validation - ensure output structure validated_output = self.schema_validator.validate(llm_output) # Layer 4: Business rules validation if not self.passes_business_rules(validated_output): raise ValidationError("Business rule validation failed") # Layer 5: Compliance checks if not self.passes_compliance_checks(validated_output): raise ComplianceError("Output violates compliance rules") return validated_output

Key guardrail categories for regulated environments:

LayerPurposeExample Tools
SafetyBlock toxic/harmful contentGuardrails AI, Llama Guard, NeMo Guardrails
PrivacyPrevent PII leakagePresidio, custom validators
ComplianceEnforce regulatory constraintsCustom validators, domain rules
QualityEnsure accuracy/relevanceFactualConsistency checks
SchemaValidate structurePydantic, JSON Schema

Example compliance guardrail for financial services:

Python
class FinancialComplianceGuardrail: """Prevent outputs that could constitute unauthorized financial advice.""" PROHIBITED_PHRASES = [ "you should invest", "guaranteed returns", "risk-free investment" ] def validate(self, output: str) -> tuple[bool, str]: for phrase in self.PROHIBITED_PHRASES: if phrase.lower() in output.lower(): self.metrics.increment("compliance_violation") return False, f"Prohibited financial advice detected" return True, output

Structured Outputs

Free-form text responses are unreliable for production systems. When you need to extract specific fields or make decisions based on LLM output, parsing unstructured text is fragile and error-prone. Modern LLM APIs support JSON mode and function calling, which constrain outputs to valid schemas1.

Python
from pydantic import BaseModel from openai import OpenAI class ClassificationResult(BaseModel): category: str confidence: float reasoning: str client = OpenAI() def classify_with_schema(document_text: str) -> ClassificationResult: response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": "Classify the document."}, {"role": "user", "content": document_text} ], response_format=ClassificationResult ) return ClassificationResult.model_validate_json(response.choices[0].message.content)

Benefits of structured outputs:

  • Guaranteed valid JSON - No parsing failures from malformed responses
  • Type safety - Pydantic validation catches schema violations
  • Better prompting - The schema itself guides the model's output
  • Simplified code - No regex or string parsing needed

For providers without native structured output support, libraries like Instructor2 provide similar functionality across different LLM APIs.

Building for Reliability

In production, LLM APIs will fail. Rate limits get hit during traffic spikes. Upstream services degrade. Latency increases under load. Your system needs to gracefully handle these failures without cascading to dependent services or degrading the entire application.

Circuit Breakers

When an LLM API starts failing, continuing to send requests makes things worse. Circuit breakers detect repeated failures and stop making requests temporarily, giving the upstream service time to recover while protecting your application from timeout accumulation. This pattern, popularized by Michael Nygard in "Release It!"3, is essential for resilient distributed systems.

Loading diagram...

Prevent cascade failures when external services (LLM APIs) degrade:

Python
from circuitbreaker import circuit class LLMClient: @circuit(failure_threshold=5, recovery_timeout=60) async def generate(self, messages, **kwargs): return await self.api_client.chat.completions.create( messages=messages, **kwargs )

Graceful Degradation

When the LLM is unavailable, returning an error to the user is often the wrong choice. Depending on your use case, you might have rule-based fallbacks, simplified prompts, or human review queues. The key is ensuring your application continues to provide value even when the AI component fails.

Have meaningful fallbacks ready:

Python
class FallbackHandler: def handle(self, input_data, error): if isinstance(error, TimeoutError): return self.try_simplified_prompt(input_data) elif isinstance(error, RateLimitError): return self.rule_based_classifier(input_data) else: return self.queue_for_human_review(input_data)

Multi-Provider Redundancy

Relying on a single LLM provider is a single point of failure. Provider outages, rate limits, and regional issues can take down your entire application. A multi-provider strategy ensures continuity when your primary provider has issues.

Python
class MultiProviderLLMClient: def __init__(self): self.providers = [ {"name": "openai", "client": OpenAIClient(), "priority": 1}, # Primary {"name": "anthropic", "client": AnthropicClient(), "priority": 2}, # Fallback {"name": "azure", "client": AzureOpenAIClient(), "priority": 3}, # Last resort ] async def generate(self, messages: list, **kwargs): for provider in sorted(self.providers, key=lambda x: x["priority"]): try: # Returns immediately on success - no other providers called return await provider["client"].generate(messages, **kwargs) except (RateLimitError, ServiceUnavailableError): # Only falls back to next provider on specific failures self.metrics.increment(f"provider_fallback_{provider['name']}") continue # Try next provider raise AllProvidersFailedError("All LLM providers unavailable")

Considerations for multi-provider setups:

  • Prompt compatibility - Different providers may need slightly different prompts
  • Response normalization - Standardize response formats across providers
  • Cost implications - Fallback providers may have different pricing

Semantic Caching

LLM API calls are expensive and slow. Many production workloads have significant query overlap where similar questions should return similar answers. Semantic caching uses embeddings to identify similar queries and return cached responses, reducing costs by 30-70% in high-traffic scenarios4.

Python
import hashlib import numpy as np from redis import Redis class SemanticCache: def __init__(self, similarity_threshold: float = 0.95): self.redis = Redis() self.threshold = similarity_threshold self.embedding_client = EmbeddingClient() async def get_or_generate(self, query: str, generate_fn): # Generate embedding for the query query_embedding = await self.embedding_client.embed(query) # Check for semantically similar cached queries cached = await self.find_similar(query_embedding) if cached: self.metrics.increment("cache_hit") return cached["response"] # Generate new response response = await generate_fn(query) # Cache with embedding for future similarity matching await self.store(query, query_embedding, response) return response async def find_similar(self, embedding: list) -> dict | None: # Use vector similarity search (Redis VSS, Pinecone, etc.) results = await self.redis.ft_search( embedding, k=1, score_threshold=self.threshold ) return results[0] if results else None

For simpler use cases, exact-match caching with hashed prompts can also provide significant benefits:

Python
def get_cache_key(messages: list, model: str) -> str: content = json.dumps({"messages": messages, "model": model}, sort_keys=True) return hashlib.sha256(content.encode()).hexdigest()

Streaming Responses

For user-facing applications, waiting 5-10 seconds for a complete response creates poor user experience. Streaming delivers tokens as they're generated, reducing perceived latency from seconds to milliseconds. Users see immediate feedback while the full response generates.

Python
async def stream_response(messages: list): stream = await client.chat.completions.create( model="gpt-4o", messages=messages, stream=True ) full_response = "" async for chunk in stream: if chunk.choices[0].delta.content: token = chunk.choices[0].delta.content full_response += token yield token # Send to client immediately # Log complete response for audit await audit_logger.log({"response": full_response})

Streaming considerations:

  • Audit logging - Buffer the complete response for compliance logging
  • Validation - Can only validate after stream completes
  • Error handling - Streams can fail mid-response
  • Token counting - Track usage from stream metadata, not content length

Monitoring and Observability

Traditional application metrics (CPU, memory, requests/second) are necessary but insufficient for GenAI systems. You need metrics specific to LLM behavior: token consumption (which drives cost), output quality degradation, and prompt-level performance.

Without proper observability, you won't know when prompt changes degrade accuracy, when costs spike due to inefficient prompts, or when latency impacts user experience. Every LLM request should be instrumented.

Key Metrics to Track

Metric CategoryWhat to TrackWhy It Matters
PerformanceLatency (p50, p95, p99)User experience
QualityValidation failure rateModel drift detection
CostTokens per requestBudget control
ReliabilityCircuit breaker stateService health
BusinessHuman review queue sizeOperational load
Python
from prometheus_client import Counter, Histogram, Gauge # Track requests, latency, and token usage llm_requests_total = Counter('llm_requests_total', 'Total requests', ['status']) llm_latency = Histogram('llm_latency_seconds', 'Request latency') llm_tokens = Histogram('llm_tokens_used', 'Tokens consumed') # Track quality and business metrics validation_failures = Counter('validation_failures', 'Failed validations') review_queue_size = Gauge('review_queue_size', 'Items awaiting review')

LLM-Specific Observability Tools

While Prometheus handles infrastructure metrics, LLM-specific observability tools like Langfuse, LangSmith, or Phoenix provide trace-level visibility into prompt execution. You can see exactly which prompts are slow, which produce validation failures, and how changes to prompts affect quality over time.

Python
from langfuse import Langfuse langfuse = Langfuse() # Reads LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY from env async def classify_with_tracing(input_data: dict): trace = langfuse.trace( name="document-classification", metadata={ "prompt_version": prompt_builder.version, "code_version": os.getenv("GIT_COMMIT_SHA", "unknown")[:8] } ) generation = trace.generation(name="classify", input=messages) response = await llm_client.generate(messages) generation.end(output=response.content, usage={ "input_tokens": response.usage.prompt_tokens, "output_tokens": response.usage.completion_tokens }) return response

Alerting

Alerts should trigger on conditions that indicate degraded service or compliance risk. High latency affects user experience. Validation failures suggest model drift or prompt issues. Circuit breakers opening indicate upstream problems that need immediate attention.

Set up alerts for:

YAML
# Example Prometheus alert rules groups: - name: genai_reliability rules: - alert: HighLLMLatency expr: histogram_quantile(0.95, llm_latency_seconds) > 10 for: 5m annotations: summary: "95th percentile LLM latency above 10s" - alert: HighValidationFailureRate expr: rate(output_validation_failures_total[5m]) > 0.1 for: 10m annotations: summary: "More than 10% of outputs failing validation" - alert: CircuitBreakerOpen expr: circuit_breaker_state{service="llm"} == 1 for: 2m annotations: summary: "LLM circuit breaker is open"

Compliance and Auditability

In regulated industries, "the model said so" is not an acceptable explanation. Auditors, compliance teams, and regulators need to understand why a system made a particular decision. This requires complete traceability: what input was received, which prompt version was used, what the model returned, and how it was validated.

Loading diagram...

Complete Audit Trails

Audit logs serve multiple purposes: debugging production issues, regulatory compliance, and improving models with production data. The key is capturing enough information to reconstruct any decision while respecting data retention and privacy requirements.

Every request needs to be traceable:

Python
class LLMRequestAudit(Base): __tablename__ = 'llm_request_audit' id = Column(Integer, primary_key=True) timestamp = Column(DateTime, nullable=False) request_id = Column(String, unique=True) # Track inputs, prompts, model, and outputs input_hash = Column(String) prompt_version = Column(String) model_name = Column(String) output_hash = Column(String) # Performance and compliance latency_ms = Column(Integer) tokens_used = Column(Integer) data_retention_until = Column(DateTime) pii_detected = Column(Boolean)

PII Detection and Redaction

Sending customer PII to third-party LLM APIs creates both privacy risks and compliance liabilities. In the EU, GDPR Article 32 requires organizations to implement "appropriate technical and organizational measures" to protect personal data5. Detecting and redacting PII before it reaches the model is essential, especially in industries like finance and healthcare where regulations like GDPR (EU), CCPA (California, US), or HIPAA (US) apply.

PII Categories and Risk Levels

Different types of PII carry different levels of risk and regulatory requirements. Understanding these categories helps you apply appropriate protections:

PII CategoryExamplesRisk LevelRegulatory Impact
Direct IdentifiersName, SSN, Passport numberHighGDPR Art. 4(1) (EU), CCPA (California)
Financial DataCredit card, bank account, IBANCriticalPCI DSS (global), PSD2 (EU)
Contact InformationEmail, phone, addressMediumGDPR (EU), CAN-SPAM (US)
Biometric DataFingerprints, facial recognitionCriticalGDPR Art. 9 (EU), BIPA (Illinois)
Health InformationMedical records, diagnosesCriticalHIPAA (US), GDPR Art. 9 (EU)
Quasi-IdentifiersZIP code + age + genderMediumCan re-identify when combined
Sensitive AttributesRace, religion, political viewsHighGDPR Art. 9 (EU), various state laws (US)
Python
from presidio_analyzer import AnalyzerEngine from presidio_anonymizer import AnonymizerEngine class PIIHandler: # Organize entities by risk level CRITICAL_ENTITIES = ["CREDIT_CARD", "IBAN_CODE", "MEDICAL_LICENSE", "US_SSN"] HIGH_RISK_ENTITIES = ["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER"] MEDIUM_RISK_ENTITIES = ["LOCATION", "DATE_TIME", "IP_ADDRESS"] def __init__(self): self.analyzer = AnalyzerEngine() self.anonymizer = AnonymizerEngine() def detect_and_redact(self, text: str, risk_threshold: str = "medium") -> tuple[str, list]: # Select entities based on risk threshold entities_to_detect = [] if risk_threshold in ["critical", "high", "medium"]: entities_to_detect.extend(self.CRITICAL_ENTITIES) if risk_threshold in ["high", "medium"]: entities_to_detect.extend(self.HIGH_RISK_ENTITIES) if risk_threshold == "medium": entities_to_detect.extend(self.MEDIUM_RISK_ENTITIES) results = self.analyzer.analyze( text=text, entities=entities_to_detect, language="en" ) # Use different anonymization strategies based on PII type anonymized = self.anonymizer.anonymize( text=text, analyzer_results=results, operators={ "CREDIT_CARD": {"type": "mask", "masking_char": "*", "chars_to_mask": 12}, "EMAIL_ADDRESS": {"type": "hash"}, "PERSON": {"type": "replace", "new_value": "[PERSON]"} } ) return anonymized.text, results

Data Retention

Different types of data require different retention periods based on regulatory requirements and business needs. Under the EU's 5th Anti-Money Laundering Directive, financial institutions must retain customer due diligence data for at least five years6. Automated retention policies ensure compliance without manual intervention.

Compliance Reminder: Retention periods vary by jurisdiction and industry. Always consult your legal team for specific requirements. The examples below are illustrative only.

Python
from datetime import datetime, timedelta class AuditLogger: # Example retention periods - adjust based on your jurisdiction and use case # EU AML directive: 5 years; Tax records: varies by country (6-10 years) RETENTION_DAYS = {"financial": 1825, "standard": 730, "operational": 365} def log(self, audit_data: dict): retention = self.RETENTION_DAYS.get(audit_data.get("classification"), 730) audit_data['data_retention_until'] = datetime.now() + timedelta(days=retention) self.db.add(LLMRequestAudit(**audit_data)) self.db.commit()

Data Residency

When using third-party LLM APIs, your data travels to their infrastructure. For regulated industries, this raises critical questions: Where is the data processed? Is it stored? Who can access it? In the EU, GDPR requires that personal data be processed with adequate protections, which may restrict transfers to certain jurisdictions7. Similar data localization requirements exist in other regions (e.g., China's PIPL, Russia's data localization law).

Key considerations for LLM data residency:

ConcernQuestion to AnswerMitigation
Processing LocationWhere are API servers located?Use regional endpoints (Azure OpenAI EU, etc.)
Data RetentionDoes the provider store prompts/responses?Review provider data policies, opt out of training
SubprocessorsWho else handles the data?Review provider's subprocessor list
Cross-border TransferDoes data leave your jurisdiction?Use providers with local presence
Python
class DataResidencyValidator: ALLOWED_REGIONS = ["eu-west-1", "eu-central-1"] # EU only def validate_provider(self, provider_config: dict) -> bool: if provider_config["region"] not in self.ALLOWED_REGIONS: raise DataResidencyError( f"Provider region {provider_config['region']} not in allowed regions" ) if not provider_config.get("data_processing_agreement"): raise ComplianceError("DPA required for this provider") return True

For highly sensitive data, consider self-hosted models or on-premise solutions that keep data entirely within your infrastructure.

AI Risk Classification

The EU AI Act introduces a risk-based regulatory framework that categorizes AI systems by their potential impact on fundamental rights and safety8. Understanding where your GenAI system falls in this classification determines your compliance obligations.

EU AI Act Risk Categories

Risk LevelDefinitionExamplesKey Requirements
UnacceptableProhibited systemsSocial scoring, emotion recognition in workplaces, real-time biometric surveillanceBanned in EU
High-RiskSignificant impact on rights/safetyEmployment decisions, credit scoring, law enforcement, critical infrastructureMandatory conformity assessment, documentation, human oversight
Limited RiskTransparency concernsChatbots, emotion recognition (opt-in), deepfakesTransparency obligations (disclose AI use)
Minimal RiskLow impactSpam filters, AI-enabled games, recommendation systemsNo specific obligations

GenAI systems in regulated industries typically fall under High-Risk when used for:

  • Credit decisions - Loan approvals, credit scoring (EU AI Act Annex III, Point 5(b))
  • Employment - Recruitment screening, performance evaluation (Annex III, Point 4)
  • Essential services - Healthcare diagnostics, insurance underwriting
  • Law enforcement - Risk assessment tools

High-risk systems require conformity assessments, human oversight, and comprehensive documentation before deployment.

High-Risk System Compliance Checklist

If your GenAI system is classified as high-risk, you must implement:

  1. Risk Management System

    • Identify and analyze known/foreseeable risks
    • Implement mitigation measures
    • Test with representative data
    • Monitor throughout lifecycle
  2. Data Governance

    • Training data quality and relevance
    • Bias detection in datasets
    • Data protection measures (GDPR compliance in EU, applicable privacy laws elsewhere)
  3. Technical Documentation

    • System design and architecture
    • Model cards with performance metrics
    • Intended use and limitations
    • Human oversight measures
  4. Transparency Requirements

    • Users must be informed they're interacting with AI
    • Provide clear information on system capabilities and limitations
    • Document decision-making logic
  5. Human Oversight

    • Humans can override AI decisions
    • Ability to interrupt system operation
    • Humans understand system capabilities
Python
class HighRiskAICompliance: """Enforce EU AI Act requirements for high-risk systems.""" def __init__(self): self.risk_register = [] self.oversight_enabled = True async def process_decision(self, input_data: dict, context: dict) -> dict: # Document that AI is being used (transparency requirement) self.log_ai_disclosure(context["user_id"]) # Generate AI recommendation ai_decision = await self.ai_model.predict(input_data) # High-risk systems require human oversight if self.requires_human_review(ai_decision): ai_decision["status"] = "pending_human_review" ai_decision["human_override_available"] = True await self.queue_for_human_review(ai_decision) else: # Log decision with full traceability await self.audit_logger.log({ "decision_id": ai_decision["id"], "input_hash": self.hash_input(input_data), "output": ai_decision, "model_version": self.ai_model.version, "risk_assessment": self.assess_risk(ai_decision), "human_override_available": True }) return ai_decision def requires_human_review(self, decision: dict) -> bool: """Determine if human review is required.""" return ( decision.get("confidence") < 0.85 or decision.get("risk_score") > 0.7 or decision.get("contradicts_rules", False) )

Compliance Reminder: The EU AI Act applies to providers placing AI systems on the EU market and deployers using AI systems in the EU, regardless of where the provider is established. If you serve EU customers or have EU operations, these requirements likely apply to you.

Model Risk Management

In heavily regulated industries like banking, model risk management (MRM) frameworks apply to GenAI systems. In the US, the Federal Reserve's SR 11-7 guidance9 establishes expectations for model development, implementation, and use. Similar frameworks exist in other jurisdictions (e.g., EBA guidelines in the EU, PRA SS1/23 in the UK). While originally written for traditional statistical models, regulators increasingly expect these principles to apply to AI/ML systems.

Key MRM requirements for GenAI:

  • Model Documentation - Document the model's purpose, design, limitations, and assumptions
  • Independent Validation - Have the model reviewed by parties not involved in development
  • Ongoing Monitoring - Track model performance and drift over time
  • Change Management - Formal processes for prompt and model changes
Python
class ModelRiskDocumentation: def generate_model_card(self, model_config: dict) -> dict: return { "model_name": model_config["name"], "version": model_config["version"], "intended_use": model_config["purpose"], "limitations": [ "May hallucinate facts", "Performance varies with input length", "Not validated for languages other than English" ], "training_data": "Third-party foundation model - training data not disclosed", "evaluation_metrics": self.get_latest_evaluation_results(), "risk_rating": self.calculate_risk_tier(model_config), "approved_by": model_config.get("approval_record"), "next_review_date": self.calculate_review_date(model_config) }

Bias and Fairness

LLMs can perpetuate or amplify biases present in their training data. In regulated contexts like lending, hiring, or insurance, biased outputs can violate fair treatment regulations. In the EU, the AI Act classifies AI systems used in employment, credit, and essential services as high-risk, requiring bias testing and documentation8. In the US, the EEOC and CFPB have issued guidance on AI fairness in employment and lending decisions.

Python
class FairnessEvaluator: PROTECTED_ATTRIBUTES = ["gender", "race", "age", "disability"] async def evaluate_bias(self, model, test_cases: list) -> dict: results = {"overall_parity": True, "disparities": []} for attribute in self.PROTECTED_ATTRIBUTES: # Test with demographic variations group_results = await self.test_across_groups(model, test_cases, attribute) # Check for statistical parity disparity = self.calculate_disparity(group_results) if disparity > self.DISPARITY_THRESHOLD: results["overall_parity"] = False results["disparities"].append({ "attribute": attribute, "disparity_ratio": disparity, "details": group_results }) return results

Bias mitigation strategies:

  • Test across demographics - Evaluate outputs for different demographic groups
  • Prompt engineering - Include fairness instructions in system prompts
  • Output filtering - Flag or block potentially discriminatory outputs
  • Human review - Route high-stakes decisions through human oversight
  • Regular audits - Periodic fairness evaluations with updated test sets

Testing Production GenAI

Testing GenAI systems is fundamentally different from testing traditional software. The LLM itself is non-deterministic, but the components around it (prompt loaders, validators, fallback handlers) are fully testable. The key is knowing what to test and how.

Testing Strategy Overview

Test TypeWhat to TestDeterministic?Execution Speed
Unit TestsPrompt loading, validation logic, parsersYesFast (ms)
Integration TestsComponent interaction with mocked LLMYesFast (ms)
Evaluation PipelineGolden datasets, LLM-as-judge, regression detection, A/B testingNoSlow (min)

Unit Tests for Deterministic Components

While you can't reliably unit test LLM outputs, you can test everything else: prompt construction, output parsing, validation logic, and error handling. These tests run fast and catch regressions in your infrastructure code.

Python
def test_prompt_loader(): loader = PromptLoader() messages = loader.build_messages("classifier", {"document_text": "Test doc"}) assert "classification system" in messages[0]["content"] assert len(loader.load("classifier")['version']) == 8 def test_output_validator(): with pytest.raises(ValidationError): OutputValidator().validate("INVALID_CATEGORY")

Integration Tests with LLM Mocking

Integration tests verify that components work together correctly without relying on actual LLM API calls. Mocking LLM responses lets you test error handling, validation logic, and business workflows deterministically.

Python
@pytest.mark.asyncio async def test_classification_flow(): mock_response = MockLLMResponse(content="INVOICE") with patch('llm_client.generate', return_value=mock_response): result = await classifier.classify({"text": "Invoice #12345"}) assert result["category"] == "INVOICE"

Building a Robust Evaluation Pipeline

Production GenAI systems need continuous, multi-dimensional evaluation. Unlike traditional ML where accuracy on a test set is often sufficient, LLM outputs require evaluating multiple quality dimensions: correctness, relevance, coherence, safety, and task-specific criteria.

Golden Datasets

The foundation of any evaluation setup is a curated dataset with known correct answers. Golden datasets let you measure accuracy, detect regressions when prompts change, and compare different models or approaches systematically.

Python
async def evaluate_on_golden_set(classifier, golden_set_path: str) -> dict: test_data = pd.read_csv(golden_set_path) predictions = [] for _, row in test_data.iterrows(): result = await classifier.classify({"text": row["text"]}) predictions.append(result["category"]) report = classification_report( test_data["expected_category"], predictions, output_dict=True ) return { "accuracy": report["accuracy"], "per_class_metrics": report, "num_samples": len(test_data) }

Tips for building golden datasets:

  • Start small, grow iteratively - Begin with 50-100 high-quality examples, expand based on production edge cases
  • Include edge cases - Add examples that have caused issues in production
  • Version your datasets - Track changes as you add or modify examples
  • Balance across categories - Ensure sufficient coverage of all expected output types

Evaluation Dimensions

DimensionWhat It MeasuresEvaluation Method
CorrectnessFactual accuracyGround truth comparison, fact-checking
RelevanceResponse addresses the queryLLM-as-judge, semantic similarity
CoherenceLogical flow, readabilityLLM-as-judge, human review
SafetyNo harmful/toxic contentGuardrail checks, toxicity classifiers
ConsistencySame input → similar outputMultiple runs, variance analysis
LatencyResponse timePercentile tracking (p50, p95, p99)

LLM-as-Judge for Subjective Metrics

For dimensions like relevance and coherence, human evaluation doesn't scale. LLM-as-judge uses a separate model to evaluate outputs against defined criteria. This approach, while not perfect, correlates well with human judgments when prompts are carefully designed10.

Python
class LLMEvaluator: EVALUATION_PROMPT = """ Evaluate the following response on a scale of 1-5 for each criterion. Query: {query} Response: {response} Criteria: - Relevance: Does the response address the query? - Accuracy: Is the information factually correct? - Completeness: Does it cover all aspects of the query? Return JSON: {"relevance": int, "accuracy": int, "completeness": int, "reasoning": str} """ async def evaluate(self, query: str, response: str) -> dict: result = await self.judge_model.generate( self.EVALUATION_PROMPT.format(query=query, response=response) ) return json.loads(result)

Tip: Use a different model family for evaluation than for generation to avoid self-preference bias. If you generate with GPT-4o, consider evaluating with Claude, or vice versa.

Continuous Evaluation Pipeline

Evaluation shouldn't be a one-time activity. Set up automated pipelines that run on every prompt change, model update, or on a regular schedule to catch regressions early.

Python
class EvaluationPipeline: def __init__(self): self.metrics_store = MetricsStore() self.alert_threshold = 0.05 # 5% regression threshold async def run_evaluation(self, prompt_version: str) -> dict: # Load evaluation dataset eval_data = self.load_eval_dataset() results = { "prompt_version": prompt_version, "timestamp": datetime.now().isoformat(), "metrics": {} } # Run predictions predictions = await self.batch_predict(eval_data) # Calculate metrics across dimensions results["metrics"]["accuracy"] = self.calculate_accuracy(predictions, eval_data) results["metrics"]["avg_latency_ms"] = self.calculate_latency(predictions) results["metrics"]["llm_judge_scores"] = await self.run_llm_evaluation(predictions) # Compare against baseline baseline = self.metrics_store.get_baseline() regressions = self.detect_regressions(results["metrics"], baseline) if regressions: await self.alert_team(f"Regression detected: {regressions}") results["status"] = "regression_detected" else: results["status"] = "passed" self.metrics_store.save(results) return results def detect_regressions(self, current: dict, baseline: dict) -> list: regressions = [] for metric, value in current.items(): if isinstance(value, (int, float)): baseline_value = baseline.get(metric, value) if value < baseline_value * (1 - self.alert_threshold): regressions.append(f"{metric}: {baseline_value:.2f}{value:.2f}") return regressions

A/B Testing Prompts

When iterating on prompts, A/B testing lets you compare variants on real traffic before fully rolling out changes. This catches issues that evaluation datasets miss and provides statistical confidence in improvements.

Python
class PromptABTest: def __init__(self, control_prompt: str, variant_prompt: str, traffic_split: float = 0.1): self.control = control_prompt self.variant = variant_prompt self.variant_traffic = traffic_split # 10% to variant async def get_prompt(self, request_id: str) -> tuple[str, str]: # Deterministic assignment based on request_id bucket = int(hashlib.md5(request_id.encode()).hexdigest(), 16) % 100 if bucket < self.variant_traffic * 100: return self.variant, "variant" return self.control, "control" async def analyze_results(self) -> dict: control_metrics = await self.metrics_store.get_metrics(group="control") variant_metrics = await self.metrics_store.get_metrics(group="variant") return { "control": control_metrics, "variant": variant_metrics, "improvement": self.calculate_significance(control_metrics, variant_metrics) }

Key principles for robust evaluation:

  • Evaluate before deploying - Block deployments that fail evaluation thresholds
  • Track trends over time - Single-point metrics miss gradual degradation
  • Combine automated and human review - Use humans for edge cases and calibration
  • Version everything - Link evaluation results to specific prompt and model versions

Common Pitfalls

Key Insight: The following pitfalls are based on real production deployments. Avoiding these can save months of rework and significant cost.

Over-reliance on Model Performance

Problem: Focusing only on accuracy while ignoring latency, cost, and reliability.

I've seen teams celebrate 95% accuracy while ignoring that their p95 latency is 15 seconds or that token costs are 10x their budget. In production, a system that's 90% accurate, costs 100 USD/month, and responds in 2 seconds is often better than one that's 95% accurate, costs 2,000 USD/month, and takes 8 seconds.

Solution: Track composite metrics that matter to the business:

Python
# Business-level SLA sla_score = ( 0.4 * accuracy + # Correctness matters 0.3 * (1 - p95_latency/max_latency) + # Speed matters 0.2 * availability + # Uptime matters 0.1 * (1 - cost/budget) # Cost matters )

Ignoring Prompt Injection Risks

Problem: Users can manipulate outputs through carefully crafted inputs.

Prompt injection attacks, where users craft inputs to manipulate LLM behavior, are an active area of security research. The OWASP Top 10 for LLM Applications lists prompt injection as the #1 risk11. Unlike SQL injection, there's no perfect defense - prompts and user inputs exist in the same context space.

Security Note: There is no foolproof defense against prompt injection. String pattern matching and input sanitization are easily bypassed. Focus on architectural defenses and output validation.

Solution: Defense in depth approach - no single mitigation is sufficient, but layering defenses reduces risk:

Note: Prompt injection is an evolving threat. Research from Simon Willison and others demonstrates that string pattern matching is insufficient12. Focus on:

  • Proper role separation (system vs user messages)
  • Output validation and structured outputs
  • Monitoring for anomalous behavior
  • Human review for high-stakes decisions

No Feedback Loop

Problem: No mechanism to improve the system based on production data.

Your model's accuracy on a test set is interesting. Its accuracy on real user queries is what matters. Without collecting feedback from production use (whether explicit thumbs up/down or implicit signals like user corrections), you're flying blind. Production data reveals edge cases your test set missed.

Solution: Implement human feedback collection:

Python
class FeedbackCollector: async def collect(self, request_id: str, feedback: dict): await self.db.feedbacks.insert_one({ "request_id": request_id, "rating": feedback["rating"], "correct": feedback.get("correct") }) if await self.calculate_recent_accuracy() < 0.85: await self.trigger_retraining()

Underestimating Costs

Problem: Token usage spirals out of control in production.

What costs 5 dollars in development can cost 5,000 dollars in production when you're processing thousands of requests per day. Long prompts, verbose outputs, and unnecessary API calls add up quickly. Cost monitoring needs to be first-class, not an afterthought.

Solution: Implement cost tracking and budgets:

Python
class CostController: async def check_budget(self, user_id: str): usage = await self.get_monthly_usage(user_id) budget = await self.get_user_budget(user_id) if usage >= budget * 0.9: await self.send_alert(user_id, "approaching_budget") if usage >= budget: raise BudgetExceededError(f"Budget exceeded: {usage}/{budget}")

Conclusion

Building production GenAI systems in regulated industries requires more than prompt engineering. It demands a comprehensive approach across architecture, reliability, observability, compliance, and testing.

Key Takeaways

PriorityFocus AreaWhy It Matters
ArchitectureSeparation of concerns, structured outputs, YAML promptsMaintainability at scale
ReliabilityCircuit breakers, multi-provider, caching, streamingService availability
ObservabilityMetrics, tracing, alertsDetect issues early
ComplianceAudit logs, PII handling, data residency, bias testingLegal requirements
Risk ManagementModel documentation, validation, change controlRegulatory expectations
Testing & EvaluationUnit tests, integration, LLM-as-judge, A/B testing, regression detectionConfidence in changes
Cost ControlToken tracking, caching, budgetsFinancial sustainability

The excitement of GenAI capabilities should never overshadow the fundamentals of production systems: reliability, observability, and security.

If you're building GenAI systems in regulated environments, focus on making the boring stuff excellent. The AI will only be as valuable as the infrastructure supporting it.

References
  1. OpenAI. (2025). "Structured Outputs." https://platform.openai.com/docs/guides/structured-outputs

  2. Instructor. (2025). "Structured outputs powered by LLMs." https://github.com/jxnl/instructor

  3. Nygard, M. (2018). "Release It! Design and Deploy Production-Ready Software" (2nd ed.). Pragmatic Bookshelf.

  4. Zilliz. (2025). "Semantic Cache: A Guide to Cache Optimization for LLMs." https://zilliz.com/learn/semantic-cache

  5. European Parliament and Council. (2016). "General Data Protection Regulation (GDPR) - Article 32: Security of processing." https://gdpr-info.eu/art-32-gdpr/

  6. European Parliament and Council. (2018). "Directive (EU) 2018/843 - Fifth Anti-Money Laundering Directive." https://eur-lex.europa.eu/eli/dir/2018/843/oj

  7. European Parliament and Council. (2016). "General Data Protection Regulation (GDPR) - Chapter V: Transfers of personal data to third countries." https://gdpr-info.eu/chapter-5/

  8. European Parliament and Council. (2024). "Regulation (EU) 2024/1689 - Artificial Intelligence Act." https://eur-lex.europa.eu/eli/reg/2024/1689/oj 2

  9. Board of Governors of the Federal Reserve System (US). (2011). "SR 11-7: Guidance on Model Risk Management." https://www.federalreserve.gov/supervisionreg/srletters/sr1107.htm

  10. Zheng, L., et al. (2023). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." NeurIPS 2023. https://arxiv.org/abs/2306.05685

  11. OWASP Foundation. (2025). "OWASP Top 10 for Large Language Model Applications." https://genai.owasp.org/

  12. Willison, S. (2023-2025). "Prompt injection: What's the worst that can happen?" and ongoing research. https://simonwillison.net/series/prompt-injection/