Building Production GenAI Systems in Regulated Industries: A Technical Guide

After spending years building and operating GenAI systems in regulated industries like fintech, I've learned that the path from proof-of-concept to production is filled with challenges that have nothing to do with model performance. This guide shares the architecture patterns, monitoring strategies, and compliance considerations that actually matter when deploying GenAI at scale.

The Production Reality Gap

Most GenAI demos are impressive. Most production deployments face these realities:

Hallucinations happen - Even the best models produce incorrect outputs
Latency varies - From 1s to minutes depending on load and complexity
Costs scale non-linearly - Token usage can spike unpredictably
Compliance is mandatory - Every decision needs an audit trail
Security is paramount - PII, trade secrets, and sensitive data flow through prompts

In regulated environments, these aren't just engineering challenges - they're business and legal requirements.

Core Architecture Principles

Separation of Concerns

The temptation with GenAI is to throw everything into a single function: "user input goes in, LLM output comes out." This works for demos but fails in production. Each responsibility (preprocessing, prompt construction, LLM calls, validation, logging) should be isolated and testable independently.

Loading diagram...

Your GenAI application should be decomposed into clear layers:

Python

class GenAIClassifier:
    async def classify(self, input_data, context):
        # 1. Preprocess and sanitize input
        processed = self.preprocessor.process(input_data)

        # 2. Build prompt with automatic version tracking
        prompt = self.prompt_builder.build(processed)

        # 3. Call LLM with error handling
        try:
            response = await self.llm_client.generate(prompt)
        except (TimeoutError, RateLimitError) as e:
            return self.fallback_handler.handle(input_data, e)

        # 4. Validate output
        validated = self.validator.validate(response)

        # 5. Log for audit
        self.audit_logger.log({
            "prompt_version": self.prompt_builder.version,
            "output": validated,
            "latency_ms": response.latency
        })

        return validated

Prompt Management as Code

In my experience, prompts change more frequently than code. Product managers want to tweak wording. Compliance teams need to add disclaimers. Testing reveals edge cases that require prompt adjustments. Hardcoding prompts into Python strings creates friction and makes collaboration difficult.

Store prompts in YAML files for easier management and collaboration:

YAML

# prompts/classifier.yaml
name: document_classifier
description: Classification prompt for financial documents

system_message: |
  You are a classification system for financial documents.
  Your task is to categorize documents into exactly one of these categories:
  - INVOICE
  - CONTRACT
  - REPORT
  - OTHER

  Requirements:
  - Output ONLY the category name
  - If uncertain, output OTHER
  - Do not include explanations

user_template: |
  Document: {document_text}

parameters:
  temperature: 0.1
  max_tokens: 50

Python

import hashlib
import yaml

class PromptLoader:
    def load(self, prompt_name: str):
        with open(f"prompts/{prompt_name}.yaml") as f:
            content = f.read()
            config = yaml.safe_load(content)
            # Auto-generate version from content hash
            config['version'] = hashlib.sha256(content.encode()).hexdigest()[:8]
        return config

    def build_messages(self, prompt_name: str, variables: dict):
        config = self.load(prompt_name)
        return [
            {"role": "system", "content": config['system_message']},
            {"role": "user", "content": config['user_template'].format(**variables)}
        ]

Benefits of YAML-based prompts:

Non-developers can review and edit prompts
Clean diffs in version control
Metadata and parameters alongside the prompt
Automatic versioning from file content hash
Easy A/B testing with multiple YAML files

Defense in Depth

LLMs are probabilistic. Even with temperature set to 0, they can produce unexpected outputs. Schema changes, hallucinations, or adversarial inputs can all bypass simple validation. Your validation strategy should assume that every layer might fail and implement multiple independent checks.

Modern production systems implement defense in depth using guardrails - programmatic policies that enforce safety, compliance, and quality constraints on both inputs and outputs in real-time:

Python

class DefenseInDepthValidator:
    def __init__(self):
        self.pii_detector = PIIDetector()
        self.toxicity_checker = ToxicityChecker()
        self.schema_validator = SchemaValidator(ClassificationResult)

    async def validate(self, user_input: str, llm_output: str) -> dict:
        # Layer 1: Input validation - detect and redact PII
        if self.pii_detector.contains_pii(user_input):
            user_input = self.pii_detector.redact(user_input)

        # Layer 2: Safety check - block toxic content
        if self.toxicity_checker.is_toxic(user_input):
            raise SafetyError("Input contains toxic content")

        # Layer 3: Schema validation - ensure output structure
        validated_output = self.schema_validator.validate(llm_output)

        # Layer 4: Business rules validation
        if not self.passes_business_rules(validated_output):
            raise ValidationError("Business rule validation failed")

        # Layer 5: Compliance checks
        if not self.passes_compliance_checks(validated_output):
            raise ComplianceError("Output violates compliance rules")

        return validated_output

Key guardrail categories for regulated environments:

Layer	Purpose	Example Tools
Safety	Block toxic/harmful content	Guardrails AI, Llama Guard, NeMo Guardrails
Privacy	Prevent PII leakage	Presidio, custom validators
Compliance	Enforce regulatory constraints	Custom validators, domain rules
Quality	Ensure accuracy/relevance	FactualConsistency checks
Schema	Validate structure	Pydantic, JSON Schema

Example compliance guardrail for financial services:

Python

class FinancialComplianceGuardrail:
    """Prevent outputs that could constitute unauthorized financial advice."""

    PROHIBITED_PHRASES = [
        "you should invest",
        "guaranteed returns",
        "risk-free investment"
    ]

    def validate(self, output: str) -> tuple[bool, str]:
        for phrase in self.PROHIBITED_PHRASES:
            if phrase.lower() in output.lower():
                self.metrics.increment("compliance_violation")
                return False, f"Prohibited financial advice detected"
        return True, output

Structured Outputs

Free-form text responses are unreliable for production systems. When you need to extract specific fields or make decisions based on LLM output, parsing unstructured text is fragile and error-prone. Modern LLM APIs support JSON mode and function calling, which constrain outputs to valid schemas¹.

Python

from pydantic import BaseModel
from openai import OpenAI

class ClassificationResult(BaseModel):
    category: str
    confidence: float
    reasoning: str

client = OpenAI()

def classify_with_schema(document_text: str) -> ClassificationResult:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Classify the document."},
            {"role": "user", "content": document_text}
        ],
        response_format=ClassificationResult
    )
    return ClassificationResult.model_validate_json(response.choices[0].message.content)

Benefits of structured outputs:

Guaranteed valid JSON - No parsing failures from malformed responses
Type safety - Pydantic validation catches schema violations
Better prompting - The schema itself guides the model's output
Simplified code - No regex or string parsing needed

For providers without native structured output support, libraries like Instructor² provide similar functionality across different LLM APIs.

Building for Reliability

In production, LLM APIs will fail. Rate limits get hit during traffic spikes. Upstream services degrade. Latency increases under load. Your system needs to gracefully handle these failures without cascading to dependent services or degrading the entire application.

Circuit Breakers

When an LLM API starts failing, continuing to send requests makes things worse. Circuit breakers detect repeated failures and stop making requests temporarily, giving the upstream service time to recover while protecting your application from timeout accumulation. This pattern, popularized by Michael Nygard in "Release It!"³, is essential for resilient distributed systems.

Loading diagram...

Prevent cascade failures when external services (LLM APIs) degrade:

Python

from circuitbreaker import circuit

class LLMClient:
    @circuit(failure_threshold=5, recovery_timeout=60)
    async def generate(self, messages, **kwargs):
        return await self.api_client.chat.completions.create(
            messages=messages, **kwargs
        )

Graceful Degradation

When the LLM is unavailable, returning an error to the user is often the wrong choice. Depending on your use case, you might have rule-based fallbacks, simplified prompts, or human review queues. The key is ensuring your application continues to provide value even when the AI component fails.

Have meaningful fallbacks ready:

Python

class FallbackHandler:
    def handle(self, input_data, error):
        if isinstance(error, TimeoutError):
            return self.try_simplified_prompt(input_data)
        elif isinstance(error, RateLimitError):
            return self.rule_based_classifier(input_data)
        else:
            return self.queue_for_human_review(input_data)

Multi-Provider Redundancy

Relying on a single LLM provider is a single point of failure. Provider outages, rate limits, and regional issues can take down your entire application. A multi-provider strategy ensures continuity when your primary provider has issues.

Python

class MultiProviderLLMClient:
    def __init__(self):
        self.providers = [
            {"name": "openai", "client": OpenAIClient(), "priority": 1},      # Primary
            {"name": "anthropic", "client": AnthropicClient(), "priority": 2}, # Fallback
            {"name": "azure", "client": AzureOpenAIClient(), "priority": 3},   # Last resort
        ]

    async def generate(self, messages: list, **kwargs):
        for provider in sorted(self.providers, key=lambda x: x["priority"]):
            try:
                # Returns immediately on success - no other providers called
                return await provider["client"].generate(messages, **kwargs)
            except (RateLimitError, ServiceUnavailableError):
                # Only falls back to next provider on specific failures
                self.metrics.increment(f"provider_fallback_{provider['name']}")
                continue  # Try next provider
        raise AllProvidersFailedError("All LLM providers unavailable")

Considerations for multi-provider setups:

Prompt compatibility - Different providers may need slightly different prompts
Response normalization - Standardize response formats across providers
Cost implications - Fallback providers may have different pricing

Semantic Caching

LLM API calls are expensive and slow. Many production workloads have significant query overlap where similar questions should return similar answers. Semantic caching uses embeddings to identify similar queries and return cached responses, reducing costs by 30-70% in high-traffic scenarios⁴.

Python

import hashlib
import numpy as np
from redis import Redis

class SemanticCache:
    def __init__(self, similarity_threshold: float = 0.95):
        self.redis = Redis()
        self.threshold = similarity_threshold
        self.embedding_client = EmbeddingClient()

    async def get_or_generate(self, query: str, generate_fn):
        # Generate embedding for the query
        query_embedding = await self.embedding_client.embed(query)

        # Check for semantically similar cached queries
        cached = await self.find_similar(query_embedding)
        if cached:
            self.metrics.increment("cache_hit")
            return cached["response"]

        # Generate new response
        response = await generate_fn(query)

        # Cache with embedding for future similarity matching
        await self.store(query, query_embedding, response)
        return response

    async def find_similar(self, embedding: list) -> dict | None:
        # Use vector similarity search (Redis VSS, Pinecone, etc.)
        results = await self.redis.ft_search(
            embedding, k=1, score_threshold=self.threshold
        )
        return results[0] if results else None

For simpler use cases, exact-match caching with hashed prompts can also provide significant benefits:

Python

def get_cache_key(messages: list, model: str) -> str:
    content = json.dumps({"messages": messages, "model": model}, sort_keys=True)
    return hashlib.sha256(content.encode()).hexdigest()

Streaming Responses

For user-facing applications, waiting 5-10 seconds for a complete response creates poor user experience. Streaming delivers tokens as they're generated, reducing perceived latency from seconds to milliseconds. Users see immediate feedback while the full response generates.

Python

async def stream_response(messages: list):
    stream = await client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        stream=True
    )

    full_response = ""
    async for chunk in stream:
        if chunk.choices[0].delta.content:
            token = chunk.choices[0].delta.content
            full_response += token
            yield token  # Send to client immediately

    # Log complete response for audit
    await audit_logger.log({"response": full_response})

Streaming considerations:

Audit logging - Buffer the complete response for compliance logging
Validation - Can only validate after stream completes
Error handling - Streams can fail mid-response
Token counting - Track usage from stream metadata, not content length

Monitoring and Observability

Traditional application metrics (CPU, memory, requests/second) are necessary but insufficient for GenAI systems. You need metrics specific to LLM behavior: token consumption (which drives cost), output quality degradation, and prompt-level performance.

Without proper observability, you won't know when prompt changes degrade accuracy, when costs spike due to inefficient prompts, or when latency impacts user experience. Every LLM request should be instrumented.

Key Metrics to Track

Metric Category	What to Track	Why It Matters
Performance	Latency (p50, p95, p99)	User experience
Quality	Validation failure rate	Model drift detection
Cost	Tokens per request	Budget control
Reliability	Circuit breaker state	Service health
Business	Human review queue size	Operational load

Python

from prometheus_client import Counter, Histogram, Gauge

# Track requests, latency, and token usage
llm_requests_total = Counter('llm_requests_total', 'Total requests', ['status'])
llm_latency = Histogram('llm_latency_seconds', 'Request latency')
llm_tokens = Histogram('llm_tokens_used', 'Tokens consumed')

# Track quality and business metrics
validation_failures = Counter('validation_failures', 'Failed validations')
review_queue_size = Gauge('review_queue_size', 'Items awaiting review')

LLM-Specific Observability Tools

While Prometheus handles infrastructure metrics, LLM-specific observability tools like Langfuse, LangSmith, or Phoenix provide trace-level visibility into prompt execution. You can see exactly which prompts are slow, which produce validation failures, and how changes to prompts affect quality over time.

Python

from langfuse import Langfuse

langfuse = Langfuse()  # Reads LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY from env

async def classify_with_tracing(input_data: dict):
    trace = langfuse.trace(
        name="document-classification",
        metadata={
            "prompt_version": prompt_builder.version,
            "code_version": os.getenv("GIT_COMMIT_SHA", "unknown")[:8]
        }
    )

    generation = trace.generation(name="classify", input=messages)
    response = await llm_client.generate(messages)
    generation.end(output=response.content, usage={
        "input_tokens": response.usage.prompt_tokens,
        "output_tokens": response.usage.completion_tokens
    })

    return response

Alerting

Alerts should trigger on conditions that indicate degraded service or compliance risk. High latency affects user experience. Validation failures suggest model drift or prompt issues. Circuit breakers opening indicate upstream problems that need immediate attention.

Set up alerts for:

YAML

# Example Prometheus alert rules
groups:
  - name: genai_reliability
    rules:
      - alert: HighLLMLatency
        expr: histogram_quantile(0.95, llm_latency_seconds) > 10
        for: 5m
        annotations:
          summary: "95th percentile LLM latency above 10s"

      - alert: HighValidationFailureRate
        expr: rate(output_validation_failures_total[5m]) > 0.1
        for: 10m
        annotations:
          summary: "More than 10% of outputs failing validation"

      - alert: CircuitBreakerOpen
        expr: circuit_breaker_state{service="llm"} == 1
        for: 2m
        annotations:
          summary: "LLM circuit breaker is open"

Compliance and Auditability

In regulated industries, "the model said so" is not an acceptable explanation. Auditors, compliance teams, and regulators need to understand why a system made a particular decision. This requires complete traceability: what input was received, which prompt version was used, what the model returned, and how it was validated.

Loading diagram...

Complete Audit Trails

Audit logs serve multiple purposes: debugging production issues, regulatory compliance, and improving models with production data. The key is capturing enough information to reconstruct any decision while respecting data retention and privacy requirements.

Every request needs to be traceable:

Python

class LLMRequestAudit(Base):
    __tablename__ = 'llm_request_audit'

    id = Column(Integer, primary_key=True)
    timestamp = Column(DateTime, nullable=False)
    request_id = Column(String, unique=True)

    # Track inputs, prompts, model, and outputs
    input_hash = Column(String)
    prompt_version = Column(String)
    model_name = Column(String)
    output_hash = Column(String)

    # Performance and compliance
    latency_ms = Column(Integer)
    tokens_used = Column(Integer)
    data_retention_until = Column(DateTime)
    pii_detected = Column(Boolean)

PII Detection and Redaction

Sending customer PII to third-party LLM APIs creates both privacy risks and compliance liabilities. In the EU, GDPR Article 32 requires organizations to implement "appropriate technical and organizational measures" to protect personal data⁵. Detecting and redacting PII before it reaches the model is essential, especially in industries like finance and healthcare where regulations like GDPR (EU), CCPA (California, US), or HIPAA (US) apply.

PII Categories and Risk Levels

Different types of PII carry different levels of risk and regulatory requirements. Understanding these categories helps you apply appropriate protections:

PII Category	Examples	Risk Level	Regulatory Impact
Direct Identifiers	Name, SSN, Passport number	High	GDPR Art. 4(1) (EU), CCPA (California)
Financial Data	Credit card, bank account, IBAN	Critical	PCI DSS (global), PSD2 (EU)
Contact Information	Email, phone, address	Medium	GDPR (EU), CAN-SPAM (US)
Biometric Data	Fingerprints, facial recognition	Critical	GDPR Art. 9 (EU), BIPA (Illinois)
Health Information	Medical records, diagnoses	Critical	HIPAA (US), GDPR Art. 9 (EU)
Quasi-Identifiers	ZIP code + age + gender	Medium	Can re-identify when combined
Sensitive Attributes	Race, religion, political views	High	GDPR Art. 9 (EU), various state laws (US)

Python

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

class PIIHandler:
    # Organize entities by risk level
    CRITICAL_ENTITIES = ["CREDIT_CARD", "IBAN_CODE", "MEDICAL_LICENSE", "US_SSN"]
    HIGH_RISK_ENTITIES = ["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER"]
    MEDIUM_RISK_ENTITIES = ["LOCATION", "DATE_TIME", "IP_ADDRESS"]

    def __init__(self):
        self.analyzer = AnalyzerEngine()
        self.anonymizer = AnonymizerEngine()

    def detect_and_redact(self, text: str, risk_threshold: str = "medium") -> tuple[str, list]:
        # Select entities based on risk threshold
        entities_to_detect = []
        if risk_threshold in ["critical", "high", "medium"]:
            entities_to_detect.extend(self.CRITICAL_ENTITIES)
        if risk_threshold in ["high", "medium"]:
            entities_to_detect.extend(self.HIGH_RISK_ENTITIES)
        if risk_threshold == "medium":
            entities_to_detect.extend(self.MEDIUM_RISK_ENTITIES)

        results = self.analyzer.analyze(
            text=text,
            entities=entities_to_detect,
            language="en"
        )

        # Use different anonymization strategies based on PII type
        anonymized = self.anonymizer.anonymize(
            text=text,
            analyzer_results=results,
            operators={
                "CREDIT_CARD": {"type": "mask", "masking_char": "*", "chars_to_mask": 12},
                "EMAIL_ADDRESS": {"type": "hash"},
                "PERSON": {"type": "replace", "new_value": "[PERSON]"}
            }
        )

        return anonymized.text, results

Data Retention

Different types of data require different retention periods based on regulatory requirements and business needs. Under the EU's 5th Anti-Money Laundering Directive, financial institutions must retain customer due diligence data for at least five years⁶. Automated retention policies ensure compliance without manual intervention.

Compliance Reminder: Retention periods vary by jurisdiction and industry. Always consult your legal team for specific requirements. The examples below are illustrative only.

Python

from datetime import datetime, timedelta

class AuditLogger:
    # Example retention periods - adjust based on your jurisdiction and use case
    # EU AML directive: 5 years; Tax records: varies by country (6-10 years)
    RETENTION_DAYS = {"financial": 1825, "standard": 730, "operational": 365}

    def log(self, audit_data: dict):
        retention = self.RETENTION_DAYS.get(audit_data.get("classification"), 730)
        audit_data['data_retention_until'] = datetime.now() + timedelta(days=retention)
        self.db.add(LLMRequestAudit(**audit_data))
        self.db.commit()

Data Residency

When using third-party LLM APIs, your data travels to their infrastructure. For regulated industries, this raises critical questions: Where is the data processed? Is it stored? Who can access it? In the EU, GDPR requires that personal data be processed with adequate protections, which may restrict transfers to certain jurisdictions⁷. Similar data localization requirements exist in other regions (e.g., China's PIPL, Russia's data localization law).

Key considerations for LLM data residency:

Concern	Question to Answer	Mitigation
Processing Location	Where are API servers located?	Use regional endpoints (Azure OpenAI EU, etc.)
Data Retention	Does the provider store prompts/responses?	Review provider data policies, opt out of training
Subprocessors	Who else handles the data?	Review provider's subprocessor list
Cross-border Transfer	Does data leave your jurisdiction?	Use providers with local presence

Python

class DataResidencyValidator:
    ALLOWED_REGIONS = ["eu-west-1", "eu-central-1"]  # EU only

    def validate_provider(self, provider_config: dict) -> bool:
        if provider_config["region"] not in self.ALLOWED_REGIONS:
            raise DataResidencyError(
                f"Provider region {provider_config['region']} not in allowed regions"
            )

        if not provider_config.get("data_processing_agreement"):
            raise ComplianceError("DPA required for this provider")

        return True

For highly sensitive data, consider self-hosted models or on-premise solutions that keep data entirely within your infrastructure.

AI Risk Classification

The EU AI Act introduces a risk-based regulatory framework that categorizes AI systems by their potential impact on fundamental rights and safety⁸. Understanding where your GenAI system falls in this classification determines your compliance obligations.

EU AI Act Risk Categories

Risk Level	Definition	Examples	Key Requirements
Unacceptable	Prohibited systems	Social scoring, emotion recognition in workplaces, real-time biometric surveillance	Banned in EU
High-Risk	Significant impact on rights/safety	Employment decisions, credit scoring, law enforcement, critical infrastructure	Mandatory conformity assessment, documentation, human oversight
Limited Risk	Transparency concerns	Chatbots, emotion recognition (opt-in), deepfakes	Transparency obligations (disclose AI use)
Minimal Risk	Low impact	Spam filters, AI-enabled games, recommendation systems	No specific obligations

GenAI systems in regulated industries typically fall under High-Risk when used for:

Credit decisions - Loan approvals, credit scoring (EU AI Act Annex III, Point 5(b))
Employment - Recruitment screening, performance evaluation (Annex III, Point 4)
Essential services - Healthcare diagnostics, insurance underwriting
Law enforcement - Risk assessment tools

High-risk systems require conformity assessments, human oversight, and comprehensive documentation before deployment.

High-Risk System Compliance Checklist

If your GenAI system is classified as high-risk, you must implement:

Risk Management System
- Identify and analyze known/foreseeable risks
- Implement mitigation measures
- Test with representative data
- Monitor throughout lifecycle
Data Governance
- Training data quality and relevance
- Bias detection in datasets
- Data protection measures (GDPR compliance in EU, applicable privacy laws elsewhere)
Technical Documentation
- System design and architecture
- Model cards with performance metrics
- Intended use and limitations
- Human oversight measures
Transparency Requirements
- Users must be informed they're interacting with AI
- Provide clear information on system capabilities and limitations
- Document decision-making logic
Human Oversight
- Humans can override AI decisions
- Ability to interrupt system operation
- Humans understand system capabilities

Python

class HighRiskAICompliance:
    """Enforce EU AI Act requirements for high-risk systems."""

    def __init__(self):
        self.risk_register = []
        self.oversight_enabled = True

    async def process_decision(self, input_data: dict, context: dict) -> dict:
        # Document that AI is being used (transparency requirement)
        self.log_ai_disclosure(context["user_id"])

        # Generate AI recommendation
        ai_decision = await self.ai_model.predict(input_data)

        # High-risk systems require human oversight
        if self.requires_human_review(ai_decision):
            ai_decision["status"] = "pending_human_review"
            ai_decision["human_override_available"] = True
            await self.queue_for_human_review(ai_decision)
        else:
            # Log decision with full traceability
            await self.audit_logger.log({
                "decision_id": ai_decision["id"],
                "input_hash": self.hash_input(input_data),
                "output": ai_decision,
                "model_version": self.ai_model.version,
                "risk_assessment": self.assess_risk(ai_decision),
                "human_override_available": True
            })

        return ai_decision

    def requires_human_review(self, decision: dict) -> bool:
        """Determine if human review is required."""
        return (
            decision.get("confidence") < 0.85 or
            decision.get("risk_score") > 0.7 or
            decision.get("contradicts_rules", False)
        )

Compliance Reminder: The EU AI Act applies to providers placing AI systems on the EU market and deployers using AI systems in the EU, regardless of where the provider is established. If you serve EU customers or have EU operations, these requirements likely apply to you.

Model Risk Management

In heavily regulated industries like banking, model risk management (MRM) frameworks apply to GenAI systems. In the US, the Federal Reserve's SR 11-7 guidance⁹ establishes expectations for model development, implementation, and use. Similar frameworks exist in other jurisdictions (e.g., EBA guidelines in the EU, PRA SS1/23 in the UK). While originally written for traditional statistical models, regulators increasingly expect these principles to apply to AI/ML systems.

Key MRM requirements for GenAI:

Model Documentation - Document the model's purpose, design, limitations, and assumptions
Independent Validation - Have the model reviewed by parties not involved in development
Ongoing Monitoring - Track model performance and drift over time
Change Management - Formal processes for prompt and model changes

Python

class ModelRiskDocumentation:
    def generate_model_card(self, model_config: dict) -> dict:
        return {
            "model_name": model_config["name"],
            "version": model_config["version"],
            "intended_use": model_config["purpose"],
            "limitations": [
                "May hallucinate facts",
                "Performance varies with input length",
                "Not validated for languages other than English"
            ],
            "training_data": "Third-party foundation model - training data not disclosed",
            "evaluation_metrics": self.get_latest_evaluation_results(),
            "risk_rating": self.calculate_risk_tier(model_config),
            "approved_by": model_config.get("approval_record"),
            "next_review_date": self.calculate_review_date(model_config)
        }

Bias and Fairness

LLMs can perpetuate or amplify biases present in their training data. In regulated contexts like lending, hiring, or insurance, biased outputs can violate fair treatment regulations. In the EU, the AI Act classifies AI systems used in employment, credit, and essential services as high-risk, requiring bias testing and documentation⁸. In the US, the EEOC and CFPB have issued guidance on AI fairness in employment and lending decisions.

Python

class FairnessEvaluator:
    PROTECTED_ATTRIBUTES = ["gender", "race", "age", "disability"]

    async def evaluate_bias(self, model, test_cases: list) -> dict:
        results = {"overall_parity": True, "disparities": []}

        for attribute in self.PROTECTED_ATTRIBUTES:
            # Test with demographic variations
            group_results = await self.test_across_groups(model, test_cases, attribute)

            # Check for statistical parity
            disparity = self.calculate_disparity(group_results)
            if disparity > self.DISPARITY_THRESHOLD:
                results["overall_parity"] = False
                results["disparities"].append({
                    "attribute": attribute,
                    "disparity_ratio": disparity,
                    "details": group_results
                })

        return results

Bias mitigation strategies:

Test across demographics - Evaluate outputs for different demographic groups
Prompt engineering - Include fairness instructions in system prompts
Output filtering - Flag or block potentially discriminatory outputs
Human review - Route high-stakes decisions through human oversight
Regular audits - Periodic fairness evaluations with updated test sets

Testing Production GenAI

Testing GenAI systems is fundamentally different from testing traditional software. The LLM itself is non-deterministic, but the components around it (prompt loaders, validators, fallback handlers) are fully testable. The key is knowing what to test and how.

Testing Strategy Overview

Test Type	What to Test	Deterministic?	Execution Speed
Unit Tests	Prompt loading, validation logic, parsers	Yes	Fast (ms)
Integration Tests	Component interaction with mocked LLM	Yes	Fast (ms)
Evaluation Pipeline	Golden datasets, LLM-as-judge, regression detection, A/B testing	No	Slow (min)

Unit Tests for Deterministic Components

While you can't reliably unit test LLM outputs, you can test everything else: prompt construction, output parsing, validation logic, and error handling. These tests run fast and catch regressions in your infrastructure code.

Python

def test_prompt_loader():
    loader = PromptLoader()
    messages = loader.build_messages("classifier", {"document_text": "Test doc"})

    assert "classification system" in messages[0]["content"]
    assert len(loader.load("classifier")['version']) == 8

def test_output_validator():
    with pytest.raises(ValidationError):
        OutputValidator().validate("INVALID_CATEGORY")

Integration Tests with LLM Mocking

Integration tests verify that components work together correctly without relying on actual LLM API calls. Mocking LLM responses lets you test error handling, validation logic, and business workflows deterministically.

Python

@pytest.mark.asyncio
async def test_classification_flow():
    mock_response = MockLLMResponse(content="INVOICE")

    with patch('llm_client.generate', return_value=mock_response):
        result = await classifier.classify({"text": "Invoice #12345"})
        assert result["category"] == "INVOICE"

Building a Robust Evaluation Pipeline

Production GenAI systems need continuous, multi-dimensional evaluation. Unlike traditional ML where accuracy on a test set is often sufficient, LLM outputs require evaluating multiple quality dimensions: correctness, relevance, coherence, safety, and task-specific criteria.

Golden Datasets

The foundation of any evaluation setup is a curated dataset with known correct answers. Golden datasets let you measure accuracy, detect regressions when prompts change, and compare different models or approaches systematically.

Python

async def evaluate_on_golden_set(classifier, golden_set_path: str) -> dict:
    test_data = pd.read_csv(golden_set_path)
    predictions = []

    for _, row in test_data.iterrows():
        result = await classifier.classify({"text": row["text"]})
        predictions.append(result["category"])

    report = classification_report(
        test_data["expected_category"], predictions, output_dict=True
    )

    return {
        "accuracy": report["accuracy"],
        "per_class_metrics": report,
        "num_samples": len(test_data)
    }

Tips for building golden datasets:

Start small, grow iteratively - Begin with 50-100 high-quality examples, expand based on production edge cases
Include edge cases - Add examples that have caused issues in production
Version your datasets - Track changes as you add or modify examples
Balance across categories - Ensure sufficient coverage of all expected output types

Evaluation Dimensions

Dimension	What It Measures	Evaluation Method
Correctness	Factual accuracy	Ground truth comparison, fact-checking
Relevance	Response addresses the query	LLM-as-judge, semantic similarity
Coherence	Logical flow, readability	LLM-as-judge, human review
Safety	No harmful/toxic content	Guardrail checks, toxicity classifiers
Consistency	Same input → similar output	Multiple runs, variance analysis
Latency	Response time	Percentile tracking (p50, p95, p99)

LLM-as-Judge for Subjective Metrics

For dimensions like relevance and coherence, human evaluation doesn't scale. LLM-as-judge uses a separate model to evaluate outputs against defined criteria. This approach, while not perfect, correlates well with human judgments when prompts are carefully designed¹⁰.

Python

class LLMEvaluator:
    EVALUATION_PROMPT = """
    Evaluate the following response on a scale of 1-5 for each criterion.

    Query: {query}
    Response: {response}

    Criteria:
    - Relevance: Does the response address the query?
    - Accuracy: Is the information factually correct?
    - Completeness: Does it cover all aspects of the query?

    Return JSON: {"relevance": int, "accuracy": int, "completeness": int, "reasoning": str}
    """

    async def evaluate(self, query: str, response: str) -> dict:
        result = await self.judge_model.generate(
            self.EVALUATION_PROMPT.format(query=query, response=response)
        )
        return json.loads(result)

Tip: Use a different model family for evaluation than for generation to avoid self-preference bias. If you generate with GPT-4o, consider evaluating with Claude, or vice versa.

Continuous Evaluation Pipeline

Evaluation shouldn't be a one-time activity. Set up automated pipelines that run on every prompt change, model update, or on a regular schedule to catch regressions early.

Python

class EvaluationPipeline:
    def __init__(self):
        self.metrics_store = MetricsStore()
        self.alert_threshold = 0.05  # 5% regression threshold

    async def run_evaluation(self, prompt_version: str) -> dict:
        # Load evaluation dataset
        eval_data = self.load_eval_dataset()

        results = {
            "prompt_version": prompt_version,
            "timestamp": datetime.now().isoformat(),
            "metrics": {}
        }

        # Run predictions
        predictions = await self.batch_predict(eval_data)

        # Calculate metrics across dimensions
        results["metrics"]["accuracy"] = self.calculate_accuracy(predictions, eval_data)
        results["metrics"]["avg_latency_ms"] = self.calculate_latency(predictions)
        results["metrics"]["llm_judge_scores"] = await self.run_llm_evaluation(predictions)

        # Compare against baseline
        baseline = self.metrics_store.get_baseline()
        regressions = self.detect_regressions(results["metrics"], baseline)

        if regressions:
            await self.alert_team(f"Regression detected: {regressions}")
            results["status"] = "regression_detected"
        else:
            results["status"] = "passed"

        self.metrics_store.save(results)
        return results

    def detect_regressions(self, current: dict, baseline: dict) -> list:
        regressions = []
        for metric, value in current.items():
            if isinstance(value, (int, float)):
                baseline_value = baseline.get(metric, value)
                if value < baseline_value * (1 - self.alert_threshold):
                    regressions.append(f"{metric}: {baseline_value:.2f} → {value:.2f}")
        return regressions

A/B Testing Prompts

When iterating on prompts, A/B testing lets you compare variants on real traffic before fully rolling out changes. This catches issues that evaluation datasets miss and provides statistical confidence in improvements.

Python

class PromptABTest:
    def __init__(self, control_prompt: str, variant_prompt: str, traffic_split: float = 0.1):
        self.control = control_prompt
        self.variant = variant_prompt
        self.variant_traffic = traffic_split  # 10% to variant

    async def get_prompt(self, request_id: str) -> tuple[str, str]:
        # Deterministic assignment based on request_id
        bucket = int(hashlib.md5(request_id.encode()).hexdigest(), 16) % 100

        if bucket < self.variant_traffic * 100:
            return self.variant, "variant"
        return self.control, "control"

    async def analyze_results(self) -> dict:
        control_metrics = await self.metrics_store.get_metrics(group="control")
        variant_metrics = await self.metrics_store.get_metrics(group="variant")

        return {
            "control": control_metrics,
            "variant": variant_metrics,
            "improvement": self.calculate_significance(control_metrics, variant_metrics)
        }

Key principles for robust evaluation:

Evaluate before deploying - Block deployments that fail evaluation thresholds
Track trends over time - Single-point metrics miss gradual degradation
Combine automated and human review - Use humans for edge cases and calibration
Version everything - Link evaluation results to specific prompt and model versions

Common Pitfalls

Key Insight: The following pitfalls are based on real production deployments. Avoiding these can save months of rework and significant cost.

Over-reliance on Model Performance

Problem: Focusing only on accuracy while ignoring latency, cost, and reliability.

I've seen teams celebrate 95% accuracy while ignoring that their p95 latency is 15 seconds or that token costs are 10x their budget. In production, a system that's 90% accurate, costs 100 USD/month, and responds in 2 seconds is often better than one that's 95% accurate, costs 2,000 USD/month, and takes 8 seconds.

Solution: Track composite metrics that matter to the business:

Python

# Business-level SLA
sla_score = (
    0.4 * accuracy +           # Correctness matters
    0.3 * (1 - p95_latency/max_latency) +  # Speed matters
    0.2 * availability +       # Uptime matters
    0.1 * (1 - cost/budget)   # Cost matters
)

Ignoring Prompt Injection Risks

Problem: Users can manipulate outputs through carefully crafted inputs.

Prompt injection attacks, where users craft inputs to manipulate LLM behavior, are an active area of security research. The OWASP Top 10 for LLM Applications lists prompt injection as the #1 risk¹¹. Unlike SQL injection, there's no perfect defense - prompts and user inputs exist in the same context space.

Security Note: There is no foolproof defense against prompt injection. String pattern matching and input sanitization are easily bypassed. Focus on architectural defenses and output validation.

Solution: Defense in depth approach - no single mitigation is sufficient, but layering defenses reduces risk:

Note: Prompt injection is an evolving threat. Research from Simon Willison and others demonstrates that string pattern matching is insufficient¹². Focus on:

Proper role separation (system vs user messages)
Output validation and structured outputs
Monitoring for anomalous behavior
Human review for high-stakes decisions

No Feedback Loop

Problem: No mechanism to improve the system based on production data.

Your model's accuracy on a test set is interesting. Its accuracy on real user queries is what matters. Without collecting feedback from production use (whether explicit thumbs up/down or implicit signals like user corrections), you're flying blind. Production data reveals edge cases your test set missed.

Solution: Implement human feedback collection:

Python

class FeedbackCollector:
    async def collect(self, request_id: str, feedback: dict):
        await self.db.feedbacks.insert_one({
            "request_id": request_id,
            "rating": feedback["rating"],
            "correct": feedback.get("correct")
        })

        if await self.calculate_recent_accuracy() < 0.85:
            await self.trigger_retraining()

Underestimating Costs

Problem: Token usage spirals out of control in production.

What costs 5 dollars in development can cost 5,000 dollars in production when you're processing thousands of requests per day. Long prompts, verbose outputs, and unnecessary API calls add up quickly. Cost monitoring needs to be first-class, not an afterthought.

Solution: Implement cost tracking and budgets:

Python

class CostController:
    async def check_budget(self, user_id: str):
        usage = await self.get_monthly_usage(user_id)
        budget = await self.get_user_budget(user_id)

        if usage >= budget * 0.9:
            await self.send_alert(user_id, "approaching_budget")
        if usage >= budget:
            raise BudgetExceededError(f"Budget exceeded: {usage}/{budget}")

Conclusion

Building production GenAI systems in regulated industries requires more than prompt engineering. It demands a comprehensive approach across architecture, reliability, observability, compliance, and testing.

Key Takeaways

Priority	Focus Area	Why It Matters
Architecture	Separation of concerns, structured outputs, YAML prompts	Maintainability at scale
Reliability	Circuit breakers, multi-provider, caching, streaming	Service availability
Observability	Metrics, tracing, alerts	Detect issues early
Compliance	Audit logs, PII handling, data residency, bias testing	Legal requirements
Risk Management	Model documentation, validation, change control	Regulatory expectations
Testing & Evaluation	Unit tests, integration, LLM-as-judge, A/B testing, regression detection	Confidence in changes
Cost Control	Token tracking, caching, budgets	Financial sustainability

The excitement of GenAI capabilities should never overshadow the fundamentals of production systems: reliability, observability, and security.

If you're building GenAI systems in regulated environments, focus on making the boring stuff excellent. The AI will only be as valuable as the infrastructure supporting it.

References

OpenAI. (2025). "Structured Outputs." https://platform.openai.com/docs/guides/structured-outputs ↩
Instructor. (2025). "Structured outputs powered by LLMs." https://github.com/jxnl/instructor ↩
Nygard, M. (2018). "Release It! Design and Deploy Production-Ready Software" (2nd ed.). Pragmatic Bookshelf. ↩
Zilliz. (2025). "Semantic Cache: A Guide to Cache Optimization for LLMs." https://zilliz.com/learn/semantic-cache ↩
European Parliament and Council. (2016). "General Data Protection Regulation (GDPR) - Article 32: Security of processing." https://gdpr-info.eu/art-32-gdpr/ ↩
European Parliament and Council. (2018). "Directive (EU) 2018/843 - Fifth Anti-Money Laundering Directive." https://eur-lex.europa.eu/eli/dir/2018/843/oj ↩
European Parliament and Council. (2016). "General Data Protection Regulation (GDPR) - Chapter V: Transfers of personal data to third countries." https://gdpr-info.eu/chapter-5/ ↩
European Parliament and Council. (2024). "Regulation (EU) 2024/1689 - Artificial Intelligence Act." https://eur-lex.europa.eu/eli/reg/2024/1689/oj ↩ ↩²
Board of Governors of the Federal Reserve System (US). (2011). "SR 11-7: Guidance on Model Risk Management." https://www.federalreserve.gov/supervisionreg/srletters/sr1107.htm ↩
Zheng, L., et al. (2023). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." NeurIPS 2023. https://arxiv.org/abs/2306.05685 ↩
OWASP Foundation. (2025). "OWASP Top 10 for Large Language Model Applications." https://genai.owasp.org/ ↩
Willison, S. (2023-2025). "Prompt injection: What's the worst that can happen?" and ongoing research. https://simonwillison.net/series/prompt-injection/ ↩

Sergio Valmorisco

The Production Reality Gap

Core Architecture Principles

Separation of Concerns

Prompt Management as Code

Defense in Depth

Structured Outputs

Building for Reliability

Circuit Breakers

Graceful Degradation

Multi-Provider Redundancy

Semantic Caching

Streaming Responses

Monitoring and Observability

Key Metrics to Track

LLM-Specific Observability Tools

Alerting

Compliance and Auditability

Complete Audit Trails

PII Detection and Redaction

PII Categories and Risk Levels

Data Retention

Data Residency

AI Risk Classification

EU AI Act Risk Categories

High-Risk System Compliance Checklist

Model Risk Management

Bias and Fairness

Testing Production GenAI

Testing Strategy Overview

Unit Tests for Deterministic Components

Integration Tests with LLM Mocking

Building a Robust Evaluation Pipeline

Golden Datasets

Evaluation Dimensions

LLM-as-Judge for Subjective Metrics

Continuous Evaluation Pipeline

A/B Testing Prompts

Common Pitfalls

Over-reliance on Model Performance

Ignoring Prompt Injection Risks

No Feedback Loop

Underestimating Costs

Conclusion

Key Takeaways