Day 48: Capstone Architecture Review + Backend/API

Mục Tiêu

Sau bài này, bạn cần có một backend/API capstone đủ rõ để reviewer hiểu và chạy được:

Chốt scope capstone: Vietnamese Enterprise Knowledge Assistant.
Review architecture từ ingestion, retrieval, generation, citation, permission, observability đến evaluation.
Chuẩn hóa repo structure để project nhìn như production-style system.
Thiết kế API contract cho document upload, ingestion, query/chat, traces, feedback và eval.
Tách configuration boundary, không hard-code model/index/token budget.
Biết readiness gate trước khi chuyển sang UI/monitoring ở Day 49.
Trả lời được: backend này dùng production được chưa, cần điều kiện gì.

TL;DR

Day 48 là ngày chuyển các bài học rời rạc thành capstone có architecture rõ. Mục tiêu không phải thêm feature vô hạn, mà là đóng scope, làm backend/API có boundary tốt, có ingestion path, query path, citation, permission filter, config, tracing và eval hook. Một portfolio tốt chứng minh engineering decision, không chỉ demo chatbot trả lời vài câu.

1. Scope Capstone

Tên gợi ý:

Vietnamese Enterprise Knowledge Assistant

Problem statement:

Tài liệu nội bộ doanh nghiệp thường phân tán ở PDF, Markdown, wiki, policy file. Keyword search yếu với tiếng Việt và raw LLM dễ hallucinate hoặc leak dữ liệu. Hệ thống cần hỏi đáp tài liệu tiếng Việt có citation, permission-aware retrieval, evaluation và monitoring.

Core features:

Upload/ingest document PDF/Markdown/Text.
Parse document và normalize text.
Chunk theo page/section/heading.
Embedding tiếng Việt/multilingual.
Vector DB: Qdrant hoặc pgvector.
Sparse retrieval: BM25.
Hybrid search + RRF merge.
Reranking.
Chat/query API.
Citation theo source/page/section/chunk.
Permission-aware retrieval.
Trace latency/token/cost.
Evaluation bằng golden dataset.
Docker Compose deploy local.

Non-goals cho capstone:

Full enterprise SSO.
Multi-agent phức tạp.
Perfect UI.
Distributed Kubernetes production.
Fine-tune model mới.
Full document lifecycle/legal retention.

Scope tốt là scope có thể demo trong 3-5 phút và defend trong interview.

2. Architecture Tổng Thể

Frontend
  -> Backend API
      -> Auth/Tenant Context
      -> Document Service
      -> Ingestion Pipeline
          -> File Validator
          -> Parser
          -> Normalizer
          -> Chunker
          -> Metadata Enricher
          -> Embedding Client
          -> Vector DB / BM25 Index
      -> RAG Orchestrator
          -> Request Validator
          -> Query Normalizer
          -> Dense Retriever
          -> Sparse Retriever
          -> RRF Merger
          -> Reranker
          -> Context Builder
          -> LLM Gateway
          -> Citation Validator
          -> Guardrails
      -> Trace Store
      -> Feedback Store
      -> Eval Runner

Tách 3 path:

Path	Mục đích	Failure mode chính
Indexing path	Parse, chunk, embed, index	Duplicate chunks, stale index, bad metadata
Query path	Retrieve, rerank, generate, cite	Hallucination, invalid citation, timeout
Eval path	Replay golden set, report metrics	Non-reproducible run, missing trace

3. Repo Structure

Gợi ý production-style nhưng vẫn vừa sức capstone:

enterprise-rag-assistant/
  apps/
    api/
      app/
        main.py
        config.py
        schemas.py
        routes/
        services/
      tests/
    web/
  packages/
    rag/
      chunking.py
      retrieval.py
      reranking.py
      context.py
      citations.py
    llm/
      gateway.py
      prompts/
    eval/
      runner.py
      metrics.py
    observability/
      tracing.py
  data/
    raw/
    processed/
    eval/
  scripts/
    ingest.py
    evaluate.py
  docker-compose.yml
  .env.example
  README.md

Boundary quan trọng:

API chỉ nhận request, validate, gọi service.
RAG core không phụ thuộc framework web.
LLM gateway che provider cụ thể.
Eval runner có thể gọi API hoặc pipeline trực tiếp.
Observability không trộn vào business logic quá sâu.

4. Backend/API Contract

Endpoint tối thiểu:

Method	Path	Mục đích
`GET`	`/health`	Process alive
`GET`	`/ready`	Dependency/model/index ready
`POST`	`/documents/upload`	Upload file
`POST`	`/documents/ingest`	Parse/chunk/embed/index
`GET`	`/documents`	List documents/status
`POST`	`/query`	Ask RAG
`POST`	`/feedback`	User feedback gắn trace
`GET`	`/traces/{trace_id}`	Debug trace
`POST`	`/eval/run`	Chạy eval
`GET`	`/eval/runs/{run_id}`	Lấy eval result

Query request:

{
  "question": "Nhân viên được nghỉ phép năm bao nhiêu ngày?",
  "tenant_id": "demo",
  "user_id": "reviewer",
  "roles": ["employee"],
  "conversation_id": "demo-session-001"
}

Query response:

{
  "answer": "Nhân viên full-time được nghỉ 12 ngày phép năm theo chính sách HR. [S1]",
  "citations": [
    {
      "source_id": "S1",
      "doc_id": "hr_policy_001",
      "title": "Chính sách nhân sự",
      "chunk_id": "hr_policy_001:v1:0007",
      "page": 4,
      "section": "Nghỉ phép năm"
    }
  ],
  "trace_id": "trace_20260510_001",
  "latency_ms": {
    "retrieve": 52,
    "rerank": 176,
    "generate": 1240,
    "total": 1530
  },
  "usage": {
    "input_tokens": 1180,
    "output_tokens": 96,
    "estimated_cost_usd": 0.0021
  }
}

5. FastAPI Skeleton Gần Production

Ví dụ ngắn dùng FastAPI request/response models và Pydantic validation:

from typing import Annotated
from fastapi import FastAPI, File, HTTPException, UploadFile
from pydantic import BaseModel, Field

app = FastAPI(title="Vietnamese Enterprise Knowledge Assistant")


class QueryRequest(BaseModel):
    question: str = Field(min_length=3, max_length=2000)
    tenant_id: str = Field(min_length=1, max_length=64)
    user_id: str = Field(min_length=1, max_length=128)
    roles: list[str] = Field(default_factory=list, max_length=20)
    conversation_id: str | None = Field(default=None, max_length=128)


class Citation(BaseModel):
    source_id: str
    doc_id: str
    title: str | None = None
    chunk_id: str
    page: int | None = None
    section: str | None = None


class QueryResponse(BaseModel):
    answer: str
    citations: list[Citation]
    trace_id: str
    latency_ms: dict[str, int]
    usage: dict[str, int | float]


@app.get("/health")
def health() -> dict[str, str]:
    return {"status": "ok"}


@app.get("/ready")
def ready() -> dict[str, str]:
    # Check vector DB, embedding provider, index metadata and config.
    return {"status": "ready"}


@app.post("/documents/upload")
async def upload_document(file: Annotated[UploadFile, File()]) -> dict[str, str]:
    if file.content_type not in {"application/pdf", "text/plain", "text/markdown"}:
        raise HTTPException(status_code=415, detail="Unsupported file type")
    return {"filename": file.filename or "unknown", "status": "accepted"}


@app.post("/query", response_model=QueryResponse)
def query(request: QueryRequest) -> QueryResponse:
    # In production, call RAG service and return validated response.
    return QueryResponse(
        answer="Không đủ thông tin trong tài liệu hiện có.",
        citations=[],
        trace_id="trace_demo",
        latency_ms={"retrieve": 0, "rerank": 0, "generate": 0, "total": 0},
        usage={"input_tokens": 0, "output_tokens": 0, "estimated_cost_usd": 0.0},
    )

Điểm production cần thêm:

Dependency injection cho service clients.
Timeout/retry rõ cho provider.
Structured logging đã redact PII.
Request ID/trace ID middleware.
Rate limiting.
Auth thật.
Error format nhất quán.

6. Configuration Boundary

Không hard-code:

Model provider/model name.
Embedding model.
Reranker model.
Vector DB connection.
Chunk size/overlap.
Retrieval top-k.
Rerank top-k.
Context top-k.
Prompt version.
Index version.
Eval threshold.
Token budget.
Guardrail thresholds.

Pydantic settings mẫu:

from pydantic import Field
from pydantic_settings import BaseSettings, SettingsConfigDict


class Settings(BaseSettings):
    model_config = SettingsConfigDict(
        env_file=".env",
        env_file_encoding="utf-8",
        extra="ignore",
    )

    app_env: str = "local"
    vector_db_url: str = "http://localhost:6333"
    llm_provider: str = "openai-compatible"
    llm_model: str = "gpt-4.1-mini"
    embedding_model: str = "text-embedding-model"
    chunk_size: int = Field(default=800, ge=200, le=3000)
    chunk_overlap: int = Field(default=120, ge=0, le=1000)
    vector_top_k: int = Field(default=50, ge=1, le=200)
    bm25_top_k: int = Field(default=50, ge=1, le=200)
    rerank_top_k: int = Field(default=20, ge=1, le=100)
    context_top_k: int = Field(default=6, ge=1, le=20)
    max_context_tokens: int = Field(default=6000, ge=500, le=32000)

.env.example nên có default an toàn và không chứa secret thật.

7. Ingestion Pipeline

upload document
  -> validate file type/size
  -> store raw file
  -> parse
  -> normalize text
  -> chunk
  -> attach metadata
  -> embed
  -> upsert vector DB
  -> update BM25 index
  -> mark document status indexed

Metadata tối thiểu:

{
  "tenant_id": "demo",
  "doc_id": "policy_001",
  "source_uri": "data/raw/policy.pdf",
  "title": "Chính sách nhân sự",
  "page": 3,
  "section": "Nghỉ phép năm",
  "acl_roles": ["employee"],
  "document_version": "v1",
  "index_version": "rag-index-v1",
  "content_hash": "sha256:..."
}

Production concern:

Ingestion phải idempotent.
Re-run không tạo duplicate chunks.
Có document status: uploaded, parsing, indexed, failed.
Store error reason để debug.
Không index document vượt size/type policy.
Metadata ACL phải đi cùng chunk.

8. Query Pipeline

question
  -> validate request
  -> tenant/ACL context
  -> normalize query
  -> BM25 top 50
  -> vector top 50
  -> RRF merge
  -> rerank top 20-50
  -> permission/context filter
  -> context top 5-8
  -> generate answer
  -> validate schema
  -> validate citation
  -> log trace
  -> return answer + citations + trace_id

Fallback:

Empty retrieval: trả "không đủ thông tin".
Reranker timeout: dùng hybrid rank.
Citation invalid: retry một lần hoặc refuse safe.
Provider timeout: trả retryable error có trace ID.
Eval mode: lưu full trace đã redact.

9. Trade-Offs Và Best Solution

Quyết định	Option A	Option B	Best solution theo context
Vector DB	Qdrant	pgvector	Qdrant nhanh cho demo vector-first; pgvector hợp stack Postgres
Retrieval	Dense only	Hybrid	Hybrid cho tài liệu tiếng Việt + thuật ngữ nội bộ
Rerank	Không rerank	Cross-encoder rerank	Rerank top 20-50 nếu latency budget cho phép
API	Sync đơn giản	Async/background jobs	Upload sync, ingestion async nếu file lớn
Auth	Demo roles	Real SSO/JWT	Capstone dùng roles rõ; production cần auth thật
Eval	API-level	Pipeline-level	Có cả hai: pipeline debug nhanh, API e2e trước release
Observability	Logs	Traces + metrics	Trace theo request để debug RAG layers

10. Performance Và Capacity

Cần đo theo stage:

Upload/parse time.
Chunk count per document.
Embedding throughput.
Vector upsert latency.
BM25 retrieval latency.
Vector retrieval latency.
Rerank latency.
LLM generation latency.
Total p50/p95 latency.
Token/cost per request.

Budget demo hợp lý:

Stage	Target
`/health`	< 50 ms
`/ready`	< 500 ms
Retrieval	< 300 ms
Rerank	< 1000 ms
Generate	< 5000 ms
Total query p95	< 7000 ms

Nếu latency quá cao:

Giảm rerank_top_k.
Giảm context_top_k.
Cache embedding query phổ biến.
Dùng cheaper/faster model cho low-risk query.
Tách ingestion ra background queue.

11. Readiness Gate Trước Day 49

12. Dùng Được Trong Production Không?

Có thể dùng làm nền production, nhưng bản capstone chưa nên được gọi là production hoàn chỉnh nếu thiếu auth, security review và vận hành thật.

Điều kiện để production:

Auth/tenant/ACL thật, enforce trước retrieval.
Ingestion async, idempotent, có retry và status.
Vector DB/index có backup, migration/versioning.
Secret management qua vault/env, không commit key.
API có rate limit, timeout, structured error và observability.
Guardrails từ Day 46 được tích hợp.
Eval gate từ Day 47 chạy trước release.
Monitoring từ Day 49 có alert.
Có rollback cho prompt/model/index.

Với portfolio, mục tiêu hợp lý là "production-style": architecture và code thể hiện đúng boundary, có demo local, có metrics/eval/guardrails, và limitations được nói thẳng.

Tài liệu

1. API Endpoint Checklist

Endpoint	Request validation	Response contract	Trace?	Notes
`GET /health`	None	`{"status":"ok"}`	No	Process alive
`GET /ready`	None	dependency statuses	Optional	Check vector DB/index/provider
`POST /documents/upload`	file type/size	upload status	Yes	No raw secret in logs
`POST /documents/ingest`	`doc_id`, tenant, options	job/status	Yes	Prefer async for large docs
`GET /documents`	tenant/role	document statuses	Yes	Filter by tenant
`POST /query`	question/tenant/user/roles	answer/citations/trace	Yes	Main RAG API
`POST /feedback`	trace/rating/reason	accepted	Yes	Tie feedback to trace
`GET /traces/{trace_id}`	auth/tenant	redacted trace	Yes	Debug view
`POST /eval/run`	eval set/version	run ID	Yes	Restrict access

2. Schema Snippets

Document Status

{
  "doc_id": "hr_policy_001",
  "title": "Chính sách nhân sự",
  "tenant_id": "demo",
  "status": "indexed",
  "document_version": "v1",
  "chunk_count": 128,
  "index_version": "enterprise_docs_v1",
  "created_at": "2026-05-10T10:00:00Z",
  "updated_at": "2026-05-10T10:05:00Z"
}

Trace

{
  "trace_id": "trace_20260510_001",
  "tenant_id": "demo",
  "prompt_version": "rag_prompt_v3",
  "model": "gpt-4.1-mini",
  "embedding_model": "embedding-v1",
  "index_version": "enterprise_docs_v1",
  "retrieval": {
    "bm25_top_k": 50,
    "vector_top_k": 50,
    "rerank_top_k": 20,
    "context_top_k": 6
  },
  "latency_ms": {
    "retrieve": 52,
    "rerank": 176,
    "generate": 1240,
    "total": 1530
  },
  "usage": {
    "input_tokens": 1180,
    "output_tokens": 96,
    "estimated_cost_usd": 0.0021
  },
  "guardrails": {
    "pii_detected": false,
    "citation_valid": true,
    "policy_action": "allow"
  }
}

3. `.env.example` Template

APP_ENV=local
API_PORT=8000

VECTOR_DB_URL=http://localhost:6333
VECTOR_COLLECTION=enterprise_docs

LLM_PROVIDER=openai-compatible
LLM_MODEL=gpt-4.1-mini
LLM_API_KEY=change-this-in-local-env

EMBEDDING_MODEL=text-embedding-model
RERANKER_MODEL=cross-encoder-model

CHUNK_SIZE=800
CHUNK_OVERLAP=120
BM25_TOP_K=50
VECTOR_TOP_K=50
RERANK_TOP_K=20
CONTEXT_TOP_K=6
MAX_CONTEXT_TOKENS=6000

PROMPT_VERSION=rag_prompt_v1
INDEX_VERSION=enterprise_docs_v1

4. Architecture Review Questions

Ingestion có idempotent không?
Chunk metadata có đủ tenant_id, acl_roles, doc_id, page, section, document_version không?
Permission filter chạy trước hay sau retrieval?
Nếu vector DB down, /ready trả gì?
Nếu reranker timeout, query pipeline fallback thế nào?
Citation validator kiểm tra bằng chunk_id hay chỉ text?
Trace có đủ prompt/model/index version không?
Eval runner gọi API hay pipeline trực tiếp?
Có đường rollback prompt/model/index không?

5. Common Architecture Mistakes

API route chứa toàn bộ RAG logic.
Retrieval không filter tenant/role.
Không version index/chunking/prompt.
Không phân biệt ingestion path và query path.
Không có /ready, chỉ có /health.
Không có trace ID trong response.
Không validate citation.
Hard-code model/top-k/token budget.
Không có no-answer fallback.
Demo dùng secret thật trong .env hoặc video.

6. Definition Of Done Cho Day 48

Có folder/documentation rõ cho capstone backend.
Có API contract đủ để frontend Day 49 dùng.
Có config boundary và .env.example.
Có architecture diagram hoặc text diagram.
Có ingestion/query/eval paths.
Có readiness checklist trước UI.
Có limitations và production conditions.

Bài tập

Mục Tiêu

Bạn sẽ biến capstone từ ý tưởng thành backend/API contract có thể build và review.

Deliverables:

Architecture diagram dạng text hoặc hình.
Repo structure.
.env.example.
FastAPI skeleton hoặc backend tương đương.
API contract cho ingestion/query/trace/eval.
Readiness checklist.

Bài Tập 1: Chốt Scope

Viết docs/scope.md:

# Scope

## Problem

## Users

## Core Features

## Non-Goals

## Demo Flow

## Risks

## Success Criteria

Success criteria phải đo được, ví dụ:

Query demo trả answer có citation.
No-answer case không hallucinate.
Eval set 30 cases chạy được.
Trace hiển thị latency/token/cost.

Bài Tập 2: Vẽ Architecture

Tạo docs/architecture.md với:

Frontend
  -> Backend API
      -> Auth/Tenant Context
      -> Ingestion Pipeline
      -> RAG Orchestrator
      -> Trace Store
      -> Eval Runner

Sau diagram, giải thích 3 path:

Indexing path.
Query path.
Eval path.

Bài Tập 3: Tạo API Schemas

Tạo apps/api/app/schemas.py:

from pydantic import BaseModel, Field


class QueryRequest(BaseModel):
    question: str = Field(min_length=3, max_length=2000)
    tenant_id: str = Field(min_length=1, max_length=64)
    user_id: str = Field(min_length=1, max_length=128)
    roles: list[str] = Field(default_factory=list, max_length=20)
    conversation_id: str | None = Field(default=None, max_length=128)


class Citation(BaseModel):
    source_id: str
    doc_id: str
    title: str | None = None
    chunk_id: str
    page: int | None = None
    section: str | None = None


class QueryResponse(BaseModel):
    answer: str
    citations: list[Citation]
    trace_id: str
    latency_ms: dict[str, int]
    usage: dict[str, int | float]

Bài Tập 4: Tạo Backend Skeleton

Tạo apps/api/app/main.py:

from fastapi import FastAPI, HTTPException
from .schemas import QueryRequest, QueryResponse

app = FastAPI(title="Vietnamese Enterprise Knowledge Assistant")


@app.get("/health")
def health() -> dict[str, str]:
    return {"status": "ok"}


@app.get("/ready")
def ready() -> dict[str, str]:
    return {"status": "ready"}


@app.post("/query", response_model=QueryResponse)
def query(request: QueryRequest) -> QueryResponse:
    if not request.roles:
        raise HTTPException(status_code=403, detail="Missing roles")
    return QueryResponse(
        answer="Không đủ thông tin trong tài liệu hiện có.",
        citations=[],
        trace_id="trace_demo",
        latency_ms={"retrieve": 0, "rerank": 0, "generate": 0, "total": 0},
        usage={"input_tokens": 0, "output_tokens": 0, "estimated_cost_usd": 0.0},
    )

Chạy local:

uvicorn apps.api.app.main:app --reload --port 8000

Bài Tập 5: Viết `.env.example`

Bắt buộc có:

VECTOR_DB_URL.
LLM_PROVIDER.
LLM_MODEL.
LLM_API_KEY.
EMBEDDING_MODEL.
RERANKER_MODEL.
CHUNK_SIZE.
CHUNK_OVERLAP.
BM25_TOP_K.
VECTOR_TOP_K.
RERANK_TOP_K.
CONTEXT_TOP_K.
MAX_CONTEXT_TOKENS.
PROMPT_VERSION.
INDEX_VERSION.

Không commit .env thật.

Bài Tập 6: Viết Readiness Gate

Tạo docs/day48_readiness.md:

# Day 48 Readiness

- [ ] Architecture diagram exists.
- [ ] API contract documented.
- [ ] `/health` works.
- [ ] `/ready` checks dependencies.
- [ ] `/query` returns answer/citations/trace_id.
- [ ] Ingestion design documented.
- [ ] Config boundary documented.
- [ ] Trace schema documented.
- [ ] Known limitations documented.

Checklist Nộp Bài

Có docs/scope.md.
Có docs/architecture.md.
Có schemas cho query/citation/trace.
Có backend skeleton chạy được.
Có .env.example.
Có API contract cho Day 49 UI.
Có readiness checklist và limitations.

Mục Tiêu

TL;DR

1. Scope Capstone

2. Architecture Tổng Thể

3. Repo Structure

4. Backend/API Contract

5. FastAPI Skeleton Gần Production

6. Configuration Boundary

7. Ingestion Pipeline

8. Query Pipeline

9. Trade-Offs Và Best Solution

10. Performance Và Capacity

11. Readiness Gate Trước Day 49

12. Dùng Được Trong Production Không?

Tài liệu

1. API Endpoint Checklist

2. Schema Snippets

Document Status

Trace

3. .env.example Template

4. Architecture Review Questions

5. Common Architecture Mistakes

6. Definition Of Done Cho Day 48

Bài tập

Mục Tiêu

Bài Tập 1: Chốt Scope

Bài Tập 2: Vẽ Architecture

Bài Tập 3: Tạo API Schemas

Bài Tập 4: Tạo Backend Skeleton

Bài Tập 5: Viết .env.example

Bài Tập 6: Viết Readiness Gate

Checklist Nộp Bài

3. `.env.example` Template

Bài Tập 5: Viết `.env.example`