Day 20: LLM App Architecture cho Production

Mục Tiêu

Sau bài này, bạn cần làm được các việc sau:

Giải thích được vì sao LLM app production không phải chỉ là một API call tới model.
Thiết kế được architecture gồm API Gateway, LLM Orchestrator, Prompt Registry, Model Router, Provider Adapter, Cache, Quota, Audit Log và Observability.
Biết đặt timeout, retry, fallback, rate limit, queue và cache cho workload LLM.
Biết thiết kế multi-tenancy để tránh leak cache, leak prompt, leak tool result và vượt quota giữa tenant.
Biết kiểm soát cost bằng token budget, model routing, cache, quota, dashboard và alert.
Build được FastAPI skeleton gần production với router, provider adapters, retry, timeout, fallback, cache, audit event và metrics metadata.
Trả lời rõ: dùng được trong production không, nếu có thì cần điều kiện gì.

TL;DR

LLM app production là một distributed system có dependency chậm, đắt tiền, không deterministic và có rủi ro security riêng. Nếu mỗi feature team gọi provider SDK trực tiếp, hệ thống sẽ nhanh chóng mất kiểm soát về prompt version, model version, cost, retry, data policy, audit và rollback.

Pattern thực tế hơn là tập trung LLM calls qua một LLM Orchestrator hoặc LLM Gateway. Component này chịu trách nhiệm build prompt theo version, chọn model, gọi provider adapter, enforce quota, cache, timeout, retry, fallback, log audit event và emit observability metadata.

1. Day 20 Nằm Ở Đâu Trong Phase 3

Day 17 giúp hiểu LLM fundamentals. Day 18 tập trung prompt engineering. Day 19 biến output thành contract bằng structured output và tool calling. Day 20 ghép các mảnh đó thành một backend architecture có thể vận hành.

Day 17: model behavior và token
Day 18: prompt design
Day 19: structured output và tool boundary
Day 20: production architecture, reliability, cost, observability
Day 21: chọn Raw SDK, LangChain, LlamaIndex, LangGraph

Với góc nhìn Senior Software Engineer:

LLM provider = external dependency có SLA, rate limit, cost và data policy
Prompt = versioned production artifact
Model = runtime dependency cần routing, rollback và evaluation
LLM response = untrusted output cần validation
Tool call = RPC đề xuất bởi model, app mới là nơi execute

2. Architecture Tổng Quan

Architecture tối thiểu cho production-style LLM app:

Client
  -> API Gateway / Auth
  -> LLM Orchestrator
      -> Tenant Policy / Quota
      -> Prompt Registry
      -> Model Router
      -> Cache Layer
      -> Provider Adapter(s)
          -> Hosted LLM Provider
          -> Local LLM / vLLM / Ollama
      -> Tool Services
      -> Audit Log
      -> Observability
  -> Response

Map về backend system quen thuộc:

Component	SE analogy	Trách nhiệm chính
API Gateway	Edge gateway	Auth, request size, tenant resolution, coarse rate limit
LLM Orchestrator	Application service	Điều phối prompt, cache, router, provider, retry, fallback
Prompt Registry	Config registry	Version prompt, owner, changelog, rollout, eval score
Model Router	Policy engine/load balancer	Chọn model theo task, tenant, latency, cost, quality, availability
Provider Adapter	DB/payment adapter	Chuẩn hóa SDK/API của từng provider
Tool Services	Internal microservices	Cung cấp capability có permission và audit
Cache Layer	Redis/CDN-like cache	Exact cache, tool result cache, retrieval cache, semantic cache
Audit Log	Compliance event log	Truy vết ai gọi gì, model nào, prompt version nào, tool nào
Observability	APM/tracing/metrics	Latency, token, cost, error, cache hit, retry, fallback

Một nguyên tắc quan trọng: business code không nên biết chi tiết SDK của từng provider. Business code nên gọi interface nội bộ như LLMClient.generate() hoặc endpoint /llm/chat, còn gateway/orchestrator xử lý policy.

3. Orchestrator Và Gateway Khác Nhau Thế Nào?

Trong nhiều team, hai khái niệm này có thể gộp hoặc tách:

Kiểu	Khi phù hợp	Trade-off
Chỉ có LLM Gateway mỏng	Nhiều service cần gọi LLM cùng một chuẩn adapter	Dễ dùng lại nhưng có thể thiếu business context
Orchestrator trong từng app	Workflow gắn chặt với domain, tool, user journey	Dễ tối ưu domain nhưng có nguy cơ duplicate policy
Gateway + Orchestrator	Platform AI chung cho nhiều app production	Tốn công thiết kế contract, version và ownership

Khuyến nghị cho course này: bắt đầu bằng một LLM Orchestrator trong backend app, nhưng thiết kế provider adapter và policy đủ sạch để sau này tách thành gateway riêng nếu nhiều team cùng dùng.

4. Prompt Registry: Prompt Là Artifact

Prompt trong production không nên là string rải rác trong code. Nó cần metadata giống config hoặc API contract.

Prompt registry nên lưu:

prompt_id, ví dụ support_triage.
version, ví dụ v1.3.0.
Template text và input variables.
Owner/team chịu trách nhiệm.
Model compatibility.
Output schema version.
Eval score trên golden set.
Changelog.
Rollout status: draft, canary, stable, deprecated.

Ví dụ metadata:

prompt_id: support_triage
version: v1.3.0
owner: support-platform
task: ticket_triage
compatible_models:
  - fast-classifier-v2
  - strong-reasoner-v1
schema_version: ticket_triage.v2
rollout: canary
eval:
  golden_set: support_tickets_2026_04
  exact_json_rate: 0.992
  priority_macro_f1: 0.87

Trace log và cache key nên luôn chứa prompt_id, prompt_version, schema_version và model_id. Nếu không, khi output thay đổi bạn sẽ không biết nguyên nhân là prompt, model, schema, data hay tool.

5. Model Router

Model router chọn model dựa trên policy, không dựa trên cảm tính. Signal thường dùng:

Task type: chat, extraction, classification, reasoning, summarization, code.
Tenant tier: free, pro, enterprise.
SLO: latency target, availability target.
Cost budget: cost/request, daily budget, monthly budget.
Data policy: provider có được xử lý PII không, region nào, retention ra sao.
Quality requirement: cần model mạnh hay model nhỏ là đủ.
Context length: input dài hay ngắn.
Availability: provider đang lỗi, 429 hoặc p95 quá cao.

Ví dụ routing rule:

Task	Primary model	Fallback	Lý do
Classification/extraction ngắn	Small/cheap model	Strong model hoặc provider khác	Output ngắn, schema rõ, cost thấp
Reasoning phức tạp	Strong model	Strong model provider khác	Chất lượng quan trọng hơn cost
Enterprise sensitive data	Provider có data policy phù hợp hoặc local model	Degrade mode/manual review	Privacy và compliance
High throughput FAQ	Cheap hosted model + cache	Local vLLM	Tối ưu cost/latency
Long report async	Strong model qua queue	Retry later/manual review	Không nên block request realtime

Fallback không miễn phí. Model fallback có thể khác format, chất lượng, latency và safety behavior. Vì vậy fallback cần được test bằng golden set riêng, không chỉ test "có trả response không".

6. Provider Adapters

Provider adapter che giấu khác biệt giữa SDK/API:

class LLMProvider(Protocol):
    name: str
    model: str

    async def generate(self, request: ProviderRequest) -> ProviderResponse:
        ...

Adapter nên chuẩn hóa:

Input messages hoặc prompt.
temperature, max_output_tokens, response_format.
Timeout.
Error type: rate limit, timeout, provider unavailable, invalid request.
Token usage.
Model/provider metadata.
Streaming hoặc non-streaming contract.

Không nên để mỗi feature team tự gọi SDK provider riêng vì các vấn đề sau:

Khó audit cost theo tenant/team/feature.
Khó enforce data policy và PII logging.
Khó rollback prompt/model.
Retry/fallback mỗi nơi một kiểu.
Observability không đồng nhất.
Secret bị copy nhiều nơi.

7. Reliability: Timeout, Retry, Fallback, Circuit Breaker

LLM dependency có failure mode riêng:

429 do rate limit provider.
5xx hoặc provider outage.
Timeout hoặc streaming bị ngắt.
Output sai schema.
Tool call fail.
Prompt quá dài làm request bị reject.
Cost spike do output quá dài hoặc retry quá nhiều.

Pattern nên có:

Pattern	Dùng khi	Lưu ý production
Timeout	Mọi LLM/tool call	Timeout nên nhỏ hơn API deadline tổng
Retry with backoff	Transient `429`, `5xx`, network error	Giới hạn attempt, thêm jitter, không retry vô hạn
Fallback model/provider	Primary lỗi hoặc quá chậm	Cần eval chất lượng fallback
Circuit breaker	Provider lỗi liên tục	Tránh làm nghẽn toàn hệ thống
Queue	Job dài, batch, report	Có deadline, max depth, retry policy
Bulkhead	Tách tenant/task quan trọng	Một tenant không được làm nghẽn tenant khác
Cancellation	Client disconnect hoặc deadline hết	Tránh đốt token vô ích

Quy tắc retry: chỉ retry operation an toàn. Với tool có side effect như gửi email, tạo refund, update ticket, cần idempotency key và audit log trước khi retry.

8. Cache: Exact, Tool Result, Retrieval, Semantic

Cache có thể giảm latency và cost rất mạnh, nhưng sai cache có thể gây data leak.

Cache	Key	Nên dùng khi	Risk
Exact prompt cache	Hash của tenant, prompt version, schema, model, normalized input	FAQ, deterministic extraction, ticket triage lặp	PII, invalidation, prompt/model drift
Tool result cache	tenant, tool name, normalized args, permission context	Lookup order/profile ít đổi	Stale data, permission drift
Retrieval cache	tenant, query, index version, ACL hash	RAG traffic lặp	Document version drift, ACL leak
Semantic cache	tenant, embedding(query), threshold, prompt version	FAQ public/high traffic	Sai ngữ cảnh, permission-sensitive answer

Production rule: cache key phải chứa tenant_id, prompt_id, prompt_version, schema_version, model_id và permission context nếu output phụ thuộc quyền truy cập.

Không cache raw prompt/response chứa PII nếu chưa có policy rõ. Có thể chỉ cache metadata hoặc cache sau khi redaction.

9. Multi-tenancy Và Quota

Tenant isolation phải xuyên suốt:

auth token
  -> tenant_id
  -> quota bucket
  -> prompt access
  -> cache namespace
  -> tool permission
  -> provider key/policy
  -> audit log partition

Các lỗi production thường gặp:

Cache key thiếu tenant_id, tenant A nhận câu trả lời của tenant B.
Tool service chỉ check user login nhưng không check tenant permission.
Log raw prompt chứa PII của nhiều tenant vào cùng một index không có access control.
Tenant free dùng model enterprise vì router không check tier.
Provider key dùng chung làm một tenant tiêu hết quota của tenant khác.

Quota nên có nhiều lớp:

Requests/minute theo tenant và user.
Tokens/day hoặc cost/day theo tenant.
Concurrent requests theo tenant.
Max input tokens và max output tokens theo endpoint/task.
Budget alert trước khi hard limit.

10. Audit Log Và Observability

Audit log trả lời câu hỏi: "Ai đã làm gì, lúc nào, với model/prompt/tool nào, tốn bao nhiêu, kết quả policy ra sao?"

Audit event tối thiểu:

{
  "trace_id": "uuid",
  "tenant_id": "tenant_a",
  "user_id_hash": "hash",
  "endpoint": "/chat",
  "task": "extract",
  "prompt_id": "support_triage",
  "prompt_version": "v1.3.0",
  "schema_version": "ticket_triage.v2",
  "provider": "provider_a",
  "model": "fast-classifier-v2",
  "input_tokens": 230,
  "output_tokens": 80,
  "estimated_cost_usd": 0.0012,
  "latency_ms": 842,
  "cache_hit": false,
  "retry_count": 1,
  "fallback_used": false,
  "tool_names": ["lookup_order"],
  "policy_decision": "allow",
  "error_code": null
}

Observability nên tách metric, log và trace:

Metrics: p50/p95/p99 latency, error rate, timeout rate, fallback rate, cache hit rate, token/request, cost/tenant.
Logs: structured event, error details, policy decision, không log raw PII mặc định.
Traces: span cho gateway, prompt build, cache lookup, provider call, tool call, validation, response.

Dashboard production nên có ít nhất:

Latency theo endpoint/task/model/provider.
Cost theo tenant/team/feature/model.
Error rate theo provider và error type.
Fallback và retry rate.
Cache hit rate.
Top tenants theo token/cost.

11. Cost Controls

Cost LLM thường tăng vì input dài, output dài, retry, tool loop và model quá mạnh cho task đơn giản.

Control nên đặt ở nhiều điểm:

Max input length và max output tokens.
Router dùng model nhỏ cho task đơn giản.
Exact cache cho request lặp.
Semantic cache chỉ khi có ACL và threshold tốt.
Per-tenant budget và alert.
Daily/monthly hard cap.
Reject hoặc degrade khi budget hết.
Log token usage và estimated cost từng request.
Golden set để đo chất lượng trước khi đổi sang model rẻ hơn.

Ví dụ policy:

Tenant tier	Model default	Daily budget	Max output tokens	Fallback khi hết budget
Free	small	1 USD	256	Trả lỗi quota hoặc template response
Pro	balanced	20 USD	1024	Chuyển sang small model
Enterprise	strong theo task	Contract-specific	2048+	Queue/manual review/degrade mode

12. Performance Considerations

Latency tổng thường là:

auth
+ request validation
+ prompt build
+ cache lookup
+ provider queueing
+ time to first token
+ output generation
+ tool calls
+ validation/postprocess
+ logging

Điểm cần nhớ:

Output token là latency driver lớn. Sinh 1000 token chậm hơn 100 token nhiều lần.
Streaming giảm perceived latency nhưng không giảm total compute.
Retry/fallback có thể làm p95/p99 tăng mạnh dù p50 vẫn đẹp.
Tool loop nhân số LLM call lên nhiều lần.
Cache hit rate 20-40% có thể giảm cost đáng kể với FAQ workload.
Queue giúp bảo vệ API realtime nhưng cần deadline, max depth và backpressure.
Provider adapter phải expose token usage để tính cost chính xác hơn estimate.

Latency budget mẫu:

Stage	Budget v1
Auth/API validation	20ms
Tenant policy/quota	10ms
Prompt build/cache lookup	30ms
LLM first response	800-2000ms
Tool call	100-500ms
Postprocess/validation	20ms
Audit log async enqueue	5-20ms
p95 target non-streaming	3-5s

13. Trade-offs

Lựa chọn	Nên dùng khi	Không nên dùng khi	Production note
Raw SDK trực tiếp	POC, script nhỏ, một team	Nhiều team, nhiều provider, cần audit	Nhanh nhưng governance yếu
LLM Gateway	Nhiều app/team cùng gọi LLM	Prototype một ngày	Tăng platform work nhưng giảm risk
Single provider	SLO chấp nhận, team nhỏ, cần đơn giản	Cần high availability/vendor hedge	Ít ops hơn, dễ optimize
Multi-provider	Cần fallback, cost routing, negotiation	Output consistency cực quan trọng	Cần eval từng provider
Sync request	Output ngắn, SLA < 5s	Job dài, multi-step agent	Dễ API/UX hơn
Async queue	Batch, report, workflow dài	Chat realtime cần token streaming	Cần job state và retry policy
Exact cache	Request lặp, deterministic	Input PII/dynamic cao	An toàn hơn semantic cache
Semantic cache	FAQ high traffic	Permission-sensitive answer	Cần ACL, threshold và eval
Local model	Privacy, cost at scale, predictable workload	Traffic thấp, thiếu GPU ops	Cần serving stack và model ops

14. FastAPI Skeleton Trong Bài

Folder này có file day20_orchestrator.py minh họa một orchestrator có:

Pydantic request/response schema.
Prompt registry in-memory.
Model router theo task và tenant tier.
Provider adapter protocol với mock providers.
Timeout, retry with backoff và fallback.
Exact cache có tenant namespace.
Quota theo tenant.
Audit event in-memory.
Metrics endpoint đơn giản.

Chạy local:

cd lessions/day-20-llm-app-architecture-production
pip install fastapi uvicorn pydantic
uvicorn day20_orchestrator:app --reload --port 8000

Gọi API:

curl -s http://127.0.0.1:8000/chat \
  -H 'content-type: application/json' \
  -d '{
    "tenant_id": "tenant_pro",
    "user_id": "user_123",
    "task": "extract",
    "message": "Khách bị tính phí hai lần sau khi nâng cấp gói.",
    "prompt_id": "support_triage",
    "prompt_version": "v1"
  }'

Mục tiêu của skeleton không phải là gọi model thật, mà là làm rõ boundary production. Khi thay mock provider bằng OpenAI, Anthropic, Gemini, local vLLM hoặc provider nội bộ, bạn giữ lại orchestrator policy.

15. Dùng Được Trong Production Không?

Có, architecture này dùng được trong production nếu đáp ứng các điều kiện sau:

Tất cả LLM calls đi qua gateway/orchestrator hoặc một interface nội bộ có policy đồng nhất.
Prompt, model, schema và tool đều có version, owner, changelog, eval và rollback.
Có timeout, retry limit, fallback policy, circuit breaker hoặc degrade mode.
Có quota và budget theo tenant/user/team.
Cache key có tenant, prompt version, schema version, model id và permission context.
Không log raw PII mặc định; có redaction, retention và access control rõ.
Tool execution có allowlist, auth, least privilege và idempotency với side effect.
Observability đo được latency, token, cost, retry, fallback, cache hit và error rate.
Thay đổi prompt/model/provider phải chạy golden set và canary trước khi rollout rộng.

Không nên gọi là production-ready nếu chỉ có một endpoint gọi SDK provider trực tiếp, không timeout, không audit, không quota, không prompt version và không biết cost/request.

16. Checklist Cuối Bài

Tài liệu

1. Architecture Decision Template

Dùng template này khi thiết kế hoặc review một LLM feature trước khi đưa vào production.

# LLM Architecture Decision

## Context

- Feature:
- Owner:
- Users:
- Tenant tiers:
- Data sensitivity:
- Expected traffic:
- p95 latency target:
- Monthly cost budget:

## Task profile

- Task type: chat / extraction / classification / reasoning / RAG / agent
- Input size:
- Output size:
- Requires tool calling:
- Requires structured output:
- Realtime or async:

## Prompt and schema

- Prompt ID:
- Prompt version:
- Schema version:
- Golden set:
- Eval metric:
- Rollback prompt version:

## Model routing

- Primary model:
- Fallback model:
- Local or hosted:
- Routing signals:
- Tenant restrictions:
- Data policy:

## Reliability

- API deadline:
- Provider timeout:
- Max retry attempts:
- Backoff:
- Fallback condition:
- Circuit breaker:
- Queue/deadline:

## Multi-tenancy

- Tenant source:
- Cache namespace:
- Quota policy:
- Tool permission:
- Audit partition:

## Cost controls

- Max input tokens:
- Max output tokens:
- Budget per tenant:
- Alert threshold:
- Degrade behavior:

## Observability

- Metrics:
- Logs:
- Traces:
- Dashboard:
- Alerts:

## Production decision

- Can be used in production:
- Required conditions:
- Known risks:
- Rollback plan:
- Final decision:

2. Component Responsibility Matrix

Component	Must do	Must not do
API Gateway	Auth, request size, tenant resolution, coarse rate limit	Build prompt hoặc gọi provider trực tiếp nếu business policy phức tạp
Orchestrator	Enforce prompt/model/cache/quota/retry/fallback policy	Bỏ qua tenant context hoặc log raw PII mặc định
Prompt Registry	Version, owner, changelog, eval metadata	Lưu prompt vô danh không rollback được
Model Router	Chọn model theo task, tier, SLO, cost, policy	Chọn model hardcode trong từng endpoint
Provider Adapter	Chuẩn hóa SDK, error, timeout, usage	Expose provider-specific detail ra business layer
Cache	Namespace theo tenant, version, permission	Cache response sensitive mà thiếu ACL
Audit Log	Ghi metadata truy vết và policy decision	Dùng thay thế metrics hoặc trace
Observability	Đo latency, token, cost, retry, fallback, cache hit	Chỉ log text response và coi là đủ

3. Prompt Registry Checklist

Có prompt_id ổn định.
Có version theo semantic hoặc incremental version.
Có owner/team.
Có expected input variables.
Có output schema version nếu dùng structured output.
Có compatible models.
Có eval score trên golden set.
Có changelog ngắn.
Có rollout status.
Có rollback version.
Cache key và trace log có prompt metadata.

4. Model Router Policy Example

models:
  fast_extractor:
    provider: mock_a
    model_id: fast-extract-v1
    max_output_tokens: 512
    cost_per_1k_tokens_usd: 0.0002
  strong_reasoner:
    provider: mock_b
    model_id: strong-reason-v1
    max_output_tokens: 2048
    cost_per_1k_tokens_usd: 0.0030
  fallback_balanced:
    provider: mock_c
    model_id: fallback-balanced-v1
    max_output_tokens: 1024
    cost_per_1k_tokens_usd: 0.0010

routing:
  extract:
    primary: fast_extractor
    fallback: fallback_balanced
  reasoning:
    primary: strong_reasoner
    fallback: fallback_balanced
  chat:
    primary: fallback_balanced
    fallback: fast_extractor

tenant_tiers:
  free:
    allowed_models: [fast_extractor]
    daily_budget_usd: 1
  pro:
    allowed_models: [fast_extractor, fallback_balanced]
    daily_budget_usd: 20
  enterprise:
    allowed_models: [fast_extractor, fallback_balanced, strong_reasoner]
    daily_budget_usd: 500

Trong production thật, policy này thường nằm trong config service hoặc database có audit trail, không hardcode tùy tiện.

5. Reliability Defaults

Setting	Default gợi ý	Lý do
API deadline realtime	5-10s	Tránh request treo quá lâu
Provider timeout	2-6s	Nhỏ hơn API deadline để còn fallback
Retry attempts	1-2	Retry nhiều làm tăng p95 và cost
Backoff	100-500ms + jitter	Giảm thundering herd
Max output tokens	Theo task	Chặn cost spike và latency spike
Queue deadline	Theo business SLA	Job quá deadline nên fail/degrade
Circuit open threshold	5-10 lỗi liên tiếp	Tránh gọi provider đang lỗi liên tục

Retry nên áp dụng cho transient errors. Không retry blindly với validation error, policy block hoặc tool side effect thiếu idempotency.

6. Cache Key Reference

Exact prompt cache key nên có đủ context:

sha256(
  tenant_id
  + user_permission_hash
  + prompt_id
  + prompt_version
  + schema_version
  + model_id
  + task
  + normalized_input
)

Không nên dùng:

sha256(user_message)

Vì key đó có thể leak giữa tenant, sai prompt version, sai model hoặc sai permission.

7. Audit Event Schema

{
  "event_type": "llm_request_completed",
  "trace_id": "uuid",
  "tenant_id": "tenant_pro",
  "user_id_hash": "sha256-prefix",
  "task": "extract",
  "prompt_id": "support_triage",
  "prompt_version": "v1",
  "schema_version": "ticket_triage.v1",
  "provider": "mock-fast",
  "model": "fast-extract-v1",
  "cache_hit": false,
  "retry_count": 1,
  "fallback_used": false,
  "input_tokens": 128,
  "output_tokens": 64,
  "estimated_cost_usd": 0.00004,
  "latency_ms": 421.7,
  "policy_decision": "allow",
  "error_code": null
}

Audit log nên append-only. Nếu cần xóa dữ liệu theo policy privacy, nên thiết kế retention và redaction rõ từ đầu.

8. Metrics Checklist

llm_requests_total{tenant_tier, task, provider, model, status}.
llm_latency_ms{task, provider, model}.
llm_provider_errors_total{provider, error_type}.
llm_retries_total{provider, task}.
llm_fallbacks_total{task, from_model, to_model}.
llm_cache_hits_total{task, cache_type}.
llm_input_tokens_total{tenant_id, task, model}.
llm_output_tokens_total{tenant_id, task, model}.
llm_estimated_cost_usd_total{tenant_id, task, model}.
llm_quota_rejections_total{tenant_id, reason}.

Nếu dùng OpenTelemetry, nên tạo span riêng cho prompt.build, cache.lookup, provider.generate, tool.call, output.validate và audit.write.

9. Security Checklist

Secrets nằm trong secret manager hoặc environment, không hardcode.
API key provider có scope và rotation plan.
Không log raw prompt/response mặc định.
Có PII redaction hoặc data classification.
Prompt injection được xử lý ở policy/tool layer, không chỉ bằng prompt.
Tool allowlist rõ ràng.
Tool write operation có idempotency key.
Tool service check tenant permission.
Cache namespace theo tenant.
Audit log có access control và retention.
Provider data retention policy được review.

10. Production Readiness Rubric

Mức	Mô tả	Dùng production?
Level 0	Endpoint gọi SDK provider trực tiếp, không timeout/quota/audit	Không
Level 1	Có timeout, schema validation, basic logging	Chỉ internal low-risk
Level 2	Có orchestrator, prompt version, quota, cache an toàn, retry/fallback	Có thể production nhỏ
Level 3	Có observability đầy đủ, golden set, canary, rollback, cost dashboard	Production tốt
Level 4	Multi-provider/local fallback, circuit breaker, tenant budgets, incident runbook	Production enterprise

11. Review Findings Cho Bản Day 20 Cũ

Nội dung tiếng Việt không dấu, chưa đạt yêu cầu readability của khóa học.
File còn phẳng, chưa tách lession.md, document.md, exercise.md.
Có skeleton FastAPI nhưng chỉ nằm trong markdown, chưa có script chạy trực tiếp.
Chưa giải thích đủ khác biệt giữa orchestrator và gateway.
Chưa đủ checklist production readiness, cost controls, tenant quota và observability metrics.
Chưa có exercise step-by-step để người học tự kiểm chứng retry, timeout, fallback, cache hit, audit log và quota.
Chưa trả lời đủ rõ điều kiện "dùng được trong production không".

Bài tập

Mục Tiêu Thực Hành

Hoàn thành bài này để bạn có một FastAPI skeleton production-style cho LLM app, dù provider hiện tại là mock. Sau lab, bạn cần chứng minh được:

Request đi qua orchestrator thay vì gọi provider trực tiếp.
Prompt được lấy theo prompt_id và prompt_version.
Model router chọn provider theo task và tenant tier.
Có timeout, retry và fallback.
Có exact cache không leak giữa tenant.
Có quota/rate limit theo tenant.
Có audit event và metrics metadata.
Có câu trả lời rõ: muốn production thật cần thay gì.

Yêu Cầu Môi Trường

cd lessions/day-20-llm-app-architecture-production
pip install fastapi uvicorn pydantic

Không cần API key vì lab dùng mock providers.

Exercise 1: Chạy Service

uvicorn day20_orchestrator:app --reload --port 8000

Kiểm tra health:

curl -s http://127.0.0.1:8000/health

Kết quả mong đợi:

{"status":"ok"}

Ghi lại:

Service start có lỗi không?
Endpoint /health có trả status ok không?
Bạn sẽ thêm readiness check nào nếu thay mock provider bằng provider thật?

Exercise 2: Gọi Task `extract`

curl -s http://127.0.0.1:8000/chat \
  -H 'content-type: application/json' \
  -d '{
    "tenant_id": "tenant_pro",
    "user_id": "user_123",
    "task": "extract",
    "message": "Khách bị tính phí hai lần sau khi nâng cấp gói.",
    "prompt_id": "support_triage",
    "prompt_version": "v1",
    "max_output_tokens": 256
  }'

Kiểm tra response có các field:

trace_id.
answer.
provider.
model.
cache_hit.
fallback_used.
retry_count.
latency_ms.
estimated_cost_usd.
prompt_id.
prompt_version.

Câu hỏi:

Vì sao response cần trace_id?
Vì sao prompt_version nên xuất hiện trong response hoặc trace?
Vì sao max_output_tokens phải có giới hạn trên?

Exercise 3: Kiểm Tra Cache Hit

Gọi lại đúng request ở Exercise 2.

Kết quả mong đợi:

Lần đầu: cache_hit=false.
Lần hai: cache_hit=true.
latency_ms lần hai thấp hơn đáng kể.

Thử đổi tenant_id sang tenant_enterprise nhưng giữ nguyên message.

Câu hỏi:

Vì sao cache không nên hit giữa hai tenant?
Nếu response phụ thuộc permission của user, cache key cần thêm gì?
Vì sao cache key cần prompt_version và model?

Exercise 4: Kiểm Tra Routing Theo Task

Gọi task reasoning:

curl -s http://127.0.0.1:8000/chat \
  -H 'content-type: application/json' \
  -d '{
    "tenant_id": "tenant_enterprise",
    "user_id": "user_456",
    "task": "reasoning",
    "message": "Hãy phân tích trade-off giữa single provider và multi-provider cho app support enterprise.",
    "prompt_id": "assistant",
    "prompt_version": "v1",
    "max_output_tokens": 512
  }'

Ghi lại:

Model nào được chọn?
Vì sao tenant enterprise được dùng model mạnh hơn?
Nếu tenant free gọi task reasoning, hệ thống nên degrade sang model nhỏ hay reject? Vì sao?

Exercise 5: Kiểm Tra Fallback

Skeleton có endpoint debug để bật lỗi provider mock:

curl -s http://127.0.0.1:8000/debug/provider/mock-fast/fail \
  -H 'content-type: application/json' \
  -d '{"fail": true}'

Gọi lại task extract với message mới để tránh cache:

curl -s http://127.0.0.1:8000/chat \
  -H 'content-type: application/json' \
  -d '{
    "tenant_id": "tenant_pro",
    "user_id": "user_123",
    "task": "extract",
    "message": "Khách muốn hủy gói vì không dùng tính năng analytics.",
    "prompt_id": "support_triage",
    "prompt_version": "v1"
  }'

Kết quả mong đợi:

fallback_used=true.
retry_count lớn hơn 0.
provider không phải mock-fast.

Tắt lỗi:

curl -s http://127.0.0.1:8000/debug/provider/mock-fast/fail \
  -H 'content-type: application/json' \
  -d '{"fail": false}'

Câu hỏi:

Fallback có thể làm response khác primary như thế nào?
Vì sao fallback cần golden set regression test?
Khi nào fallback nên trả degrade response thay vì gọi model khác?

Exercise 6: Kiểm Tra Quota

Gọi endpoint metrics:

curl -s http://127.0.0.1:8000/metrics

Quan sát:

Tổng request.
Cache hit.
Fallback count.
Estimated cost.
Quota usage theo tenant.

Thử giảm quota trong code hoặc gửi nhiều request để đạt giới hạn. Khi quota vượt, API phải trả 429.

Câu hỏi:

Quota nên tính theo request, token hay USD?
Vì sao tenant free/pro/enterprise nên có quota khác nhau?
Khi gần hết budget, nên alert hay hard fail ngay?

Exercise 7: Review Audit Log

Gọi:

curl -s http://127.0.0.1:8000/audit

Kiểm tra mỗi event có:

trace_id.
tenant_id.
user_id_hash.
task.
prompt_id.
model.
latency_ms.
cache_hit.
retry_count.
fallback_used.
estimated_cost_usd.
status.

Câu hỏi:

Vì sao audit log không nên lưu raw message mặc định?
Trường nào giúp debug cost spike?
Trường nào giúp debug provider outage?

Exercise 8: Thay Mock Provider Bằng Provider Thật

Không cần làm trong ngày này nếu bạn chưa có API key. Hãy viết design trước:

## Provider Adapter Plan

- Provider:
- SDK:
- Secret source:
- Timeout:
- Retryable errors:
- Non-retryable errors:
- Token usage field:
- Streaming support:
- Data retention policy:
- Fallback provider:
- Test cases:

Điều kiện tối thiểu trước khi gọi provider thật:

API key đọc từ environment hoặc secret manager.
Timeout bắt buộc.
Không log raw prompt.
Token usage được parse.
Error provider được map về error type nội bộ.
Unit test adapter với fake response.

Deliverable Cuối Bài

Tạo một file ghi chú ngắn, ví dụ day20_architecture_decision.md, trả lời:

Architecture của bạn gồm component nào?
Prompt registry lưu metadata gì?
Router chọn model theo rule nào?
Timeout/retry/fallback policy là gì?
Cache key có những thành phần nào?
Quota và cost budget theo tenant ra sao?
Audit log lưu gì và không lưu gì?
Dashboard production cần metric nào?
Có dùng production được không? Nếu có thì cần điều kiện gì?

Đáp Án Kỳ Vọng Ở Mức Senior SE

Một câu trả lời tốt không chỉ nói "có retry và cache". Nó phải nói được:

Retry tối đa bao nhiêu lần, retry error nào, deadline tổng là gì.
Cache key tránh leak tenant như thế nào.
Fallback có regression risk gì.
Model router giảm cost nhưng vẫn giữ quality ra sao.
Audit log đủ debug nhưng không vi phạm privacy.
Cost controls nằm ở request validation, router, quota, cache và alert.
Production readiness phụ thuộc vào eval, observability, security và rollback, không phụ thuộc vào việc endpoint chạy được trên laptop.

Mục Tiêu

TL;DR

1. Day 20 Nằm Ở Đâu Trong Phase 3

2. Architecture Tổng Quan

3. Orchestrator Và Gateway Khác Nhau Thế Nào?

4. Prompt Registry: Prompt Là Artifact

5. Model Router

6. Provider Adapters

7. Reliability: Timeout, Retry, Fallback, Circuit Breaker

8. Cache: Exact, Tool Result, Retrieval, Semantic

9. Multi-tenancy Và Quota

10. Audit Log Và Observability

11. Cost Controls

12. Performance Considerations

13. Trade-offs

14. FastAPI Skeleton Trong Bài

15. Dùng Được Trong Production Không?

16. Checklist Cuối Bài

Tài liệu

1. Architecture Decision Template

2. Component Responsibility Matrix

3. Prompt Registry Checklist

4. Model Router Policy Example

5. Reliability Defaults

6. Cache Key Reference

7. Audit Event Schema

8. Metrics Checklist

9. Security Checklist

10. Production Readiness Rubric

11. Review Findings Cho Bản Day 20 Cũ

Bài tập

Mục Tiêu Thực Hành

Yêu Cầu Môi Trường

Exercise 1: Chạy Service

Exercise 2: Gọi Task extract

Exercise 3: Kiểm Tra Cache Hit

Exercise 4: Kiểm Tra Routing Theo Task

Exercise 5: Kiểm Tra Fallback

Exercise 6: Kiểm Tra Quota

Exercise 7: Review Audit Log

Exercise 8: Thay Mock Provider Bằng Provider Thật

Deliverable Cuối Bài

Đáp Án Kỳ Vọng Ở Mức Senior SE

Exercise 2: Gọi Task `extract`