Day 47: LLM Testing, Golden Set, CI/CD Cho Prompt/RAG

Mục Tiêu

Sau bài này, bạn cần làm được:

Tạo golden set cho RAG app, không chỉ test thủ công vài câu hỏi.
Tách retrieval evaluation, generation evaluation, guardrail evaluation và system evaluation.
Đo Recall@K, MRR@K, citation correctness, faithfulness, no-answer accuracy và format pass rate.
Thiết kế CI gate cho prompt, chunking, embedding, reranker, LLM model và context builder.
Hiểu snapshot testing nên dùng ở đâu và không nên dùng ở đâu.
Thiết kế canary release, A/B testing và feedback loop cho LLM app.
Trả lời được: bộ test này đã đủ production chưa, còn thiếu điều kiện gì.

TL;DR

LLM/RAG không thể release dựa trên cảm giác "chat thử thấy ổn". Golden set chính là regression test suite của hệ thống AI. Mỗi lần đổi prompt, chunking, embedding model, reranker, retrieval top-k, LLM model hoặc guardrail, bạn cần chạy evaluation có version, metrics, threshold và trace. CI không đảm bảo câu trả lời luôn giống hệt, nhưng phải đảm bảo quality không tụt dưới release gate.

1. Vì Sao Test LLM Khác Test Backend?

Backend truyền thống thường test deterministic input/output. LLM có thêm các biến:

Model output không hoàn toàn deterministic.
Provider có thể update behavior.
Prompt nhỏ thay đổi lớn ở output.
Retrieval phụ thuộc corpus/index/chunking.
Correct answer có thể có nhiều cách diễn đạt.
User feedback không luôn phản ánh đúng quality.

Vì vậy, test LLM cần nhiều tầng:

Layer	Câu hỏi cần trả lời
Unit tests	Parser, chunker, citation validator, schema validator có đúng không?
Retrieval eval	Query có lấy đúng source/chunk không?
Generation eval	Answer có grounded và đúng format không?
Guardrail eval	Prompt injection, no-answer, ACL có bị fail không?
End-to-end eval	API trả response đúng contract, latency/cost trong budget không?
Online monitoring	Production traffic có drift, cost spike, thumbs down tăng không?

2. Golden Set Là Gì?

Golden set là tập câu hỏi đã được label trước. Mỗi record nên có expected behavior, expected source/chunk và tag phân tích.

Record mẫu:

{
  "id": "q001",
  "question": "Nhân viên full-time được nghỉ phép năm bao nhiêu ngày?",
  "expected_answer": "Nhân viên full-time được nghỉ 12 ngày phép năm.",
  "expected_chunk_ids": ["hr_leave_policy:v1:0003"],
  "must_cite": ["hr_leave_policy"],
  "expected_behavior": "answer_with_citation",
  "tags": ["hr", "easy", "single-hop", "vietnamese"],
  "difficulty": "easy"
}

Tags nên có:

easy.
synonym.
multi-hop.
no-answer.
acl.
vietnamese.
english-mix.
prompt-injection.
stale-version.
format.
high-impact.

Golden set chỉ có câu dễ sẽ tạo cảm giác an toàn giả. Bộ 30 câu đầu nên chia tương đối:

Nhóm	Số lượng gợi ý	Mục đích
Normal single-hop	8	Baseline
Synonym/paraphrase	5	Search robustness
Multi-hop	4	Context composition
No-answer/out-of-scope	4	Chống hallucination
ACL/permission	3	Data protection
Prompt injection	3	Security
Format/citation edge case	3	Contract

3. Retrieval Regression Test

Retrieval eval nên deterministic và không cần LLM judge.

Metrics:

Metric	Ý nghĩa	Khi dùng
`Hit@K`	Top K có ít nhất một chunk đúng không	Smoke signal
`Recall@K`	Lấy được bao nhiêu relevant chunks	Multi-label relevance
`MRR@K`	Chunk đúng đầu tiên nằm ở rank mấy	Ranking quality
`nDCG@K`	Có relevance score nhiều mức	Khi label graded

Implementation tối giản:

def recall_at_k(retrieved_ids: list[str], expected_ids: set[str], k: int) -> float:
    if not expected_ids:
        return 1.0
    hits = set(retrieved_ids[:k]).intersection(expected_ids)
    return len(hits) / len(expected_ids)


def mrr_at_k(retrieved_ids: list[str], expected_ids: set[str], k: int) -> float:
    for rank, chunk_id in enumerate(retrieved_ids[:k], start=1):
        if chunk_id in expected_ids:
            return 1.0 / rank
    return 0.0

Cần report theo tag, không chỉ aggregate. Ví dụ Recall@5 tổng thể 0.85 nhưng tag acl fail thì vẫn block release.

4. Generation Regression Test

Prompt thay đổi có thể làm wording khác, nên không nên snapshot full answer dài.

Nên test theo rubric:

Answer có đúng facts chính không?
Có grounded trong retrieved context không?
Có citation bắt buộc không?
Citation có nằm trong context không?
Output đúng schema không?
No-answer case có từ chối đúng không?
Không leak PII/secret/system prompt không?

Scoring options:

Cách score	Ưu điểm	Nhược điểm
Exact match	Rẻ, deterministic	Quá cứng với LLM
Keyword/rule	Nhanh, dễ CI	Bắt chất lượng hạn chế
Embedding similarity	Linh hoạt	Có false positive
LLM-as-judge	Scale tốt cho rubric	Tốn cost, judge drift
Human review	Chính xác hơn	Chậm, không scale

Best solution theo context:

CI smoke: schema, citation, no-answer, retrieval metrics, vài rule-based checks.
Nightly eval: full golden set + LLM-as-judge có rubric + trace.
Release review: xem top regressions theo tag, human spot-check các case high-impact.

5. Snapshot Testing

Snapshot tốt cho:

JSON response shape.
Citation format.
Error/refusal format.
Prompt template compiled output sau khi redact secret.
Tool call arguments.

Snapshot không tốt cho:

Free-form answer dài.
Output có nondeterminism cao.
Provider/model có wording thay đổi.
Kết quả phụ thuộc thời gian, random seed hoặc external state.

Rule thực dụng: snapshot contract, không snapshot prose.

6. CI/CD Cho Prompt/RAG

Pipeline gợi ý:

pull request
  -> lint / unit tests
  -> prompt template tests
  -> smoke eval 10-15 critical cases
  -> check thresholds
  -> block merge nếu critical metric fail

nightly
  -> full eval 30-100+ cases
  -> generate report by tag
  -> compare baseline
  -> open issue nếu regression

release
  -> full eval
  -> manual review top failures
  -> canary 5-10%
  -> monitor online metrics
  -> rollback nếu vượt guardrail

Metadata bắt buộc trong mỗi eval run:

eval_run_id.
eval_set_version.
corpus_version.
index_version.
chunking_version.
embedding_model.
retriever_config.
reranker_version.
prompt_version.
llm_model.
guardrail_version.
git_sha.

Nếu không version các yếu tố này, bạn không biết regression đến từ đâu.

7. Threshold-Based Deployment

Threshold mẫu:

recall_at_5: 0.80
mrr_at_10: 0.70
citation_correctness: 0.95
format_pass_rate: 0.98
no_answer_accuracy: 0.90
prompt_injection_block_rate: 1.00
acl_leak_count: 0
p95_latency_ms: 5000
estimated_cost_per_request_usd: 0.02

Block deploy khi:

acl_leak_count > 0.
Prompt injection critical case fail.
Citation correctness dưới ngưỡng domain yêu cầu.
Format pass rate thấp làm API client hỏng.
No-answer accuracy giảm mạnh.
Latency/cost vượt budget production.

Cho CONDITIONAL PASS khi:

Metric tổng thể đạt nhưng một tag non-critical giảm nhẹ.
Có mitigation hoặc rollback plan.
Canary được giới hạn traffic và monitor rõ.

8. Canary, A/B Testing Và Feedback Loop

Canary:

Route 5-10% traffic sang prompt/model/index mới.
Theo dõi latency, cost, citation failure, thumbs down, refusal rate.
Rollback nếu metric vượt ngưỡng.

A/B testing:

So sánh prompt/model/router bằng offline labels và user feedback.
Cần randomization hoặc segmentation rõ.
Không đưa toàn bộ user sang version mới khi chưa qua offline gate.

Feedback payload:

{
  "trace_id": "trace_20260510_001",
  "rating": "down",
  "reason": "wrong_source",
  "comment": "Answer đúng nhưng citation trỏ tài liệu cũ."
}

Feedback phải gắn với trace. Nếu chỉ lưu thumbs down mà không có retrieved chunks, prompt_version và model_version, bạn không debug được lỗi do retriever, reranker, prompt hay model.

9. Performance Và Cost

Eval có thể tốn chi phí lớn nếu chạy full generation cho mọi PR.

Chiến lược:

PR chỉ chạy smoke set nhỏ, ưu tiên critical cases.
Retrieval eval chạy nhiều hơn vì rẻ và deterministic.
Generation full eval chạy nightly hoặc trước release.
Cache retrieved results theo index_version.
Dùng cheap judge cho preliminary, human review cho high-risk.
Giới hạn concurrency để không vượt rate limit provider.

Metrics vận hành eval:

eval duration.
token cost per eval run.
judge agreement.
flaky case count.
retry count.
cases skipped do timeout/provider error.

10. Dùng Được Trong Production Không?

Có, nếu evaluation được vận hành như release gate, không phải file demo.

Điều kiện tối thiểu:

Golden set có ít nhất 30 cases, gồm normal, no-answer, ACL, prompt injection, citation và format.
Eval runner lưu trace và version đầy đủ.
Retrieval và generation được score riêng.
Có threshold theo domain, không chỉ aggregate score.
CI block critical regression.
Nightly/full eval report được review.
Online feedback gắn với trace.
Có quy trình update golden set khi corpus hoặc product scope đổi.

Không nên claim production-ready nếu:

Không có golden set versioned.
Chỉ test bằng vài câu hỏi thủ công.
Không có no-answer/ACL/security cases.
Không log prompt/model/index version.
Không có rollback/canary plan.

Checklist Cuối Bài

Tôi có golden set tối thiểu 30 cases.
Mỗi case có expected behavior, expected chunks và tags.
Tôi đo retrieval metrics riêng.
Tôi đo format/citation/no-answer riêng.
Tôi có threshold config cho CI.
Tôi có report theo tags.
Tôi có trace cho từng eval case.
Tôi biết khi nào block deploy, khi nào canary.

Tài liệu

1. Golden Set Schema

{
  "id": "q001",
  "question": "string",
  "expected_answer": "string|null",
  "expected_chunk_ids": ["chunk_id"],
  "must_cite": ["doc_id"],
  "expected_behavior": "answer_with_citation|no_answer|refuse|escalate",
  "tags": ["hr", "easy"],
  "difficulty": "easy|medium|hard",
  "notes": "optional reviewer note"
}

Validation rules:

id unique.
question không rỗng.
expected_behavior nằm trong enum.
expected_chunk_ids bắt buộc với answer_with_citation.
tags có ít nhất một domain tag và một difficulty tag.
Không đưa PII thật vào golden set public.

2. Metric Formulas

def hit_at_k(retrieved_ids: list[str], expected_ids: set[str], k: int) -> float:
    return float(bool(set(retrieved_ids[:k]).intersection(expected_ids)))


def recall_at_k(retrieved_ids: list[str], expected_ids: set[str], k: int) -> float:
    if not expected_ids:
        return 1.0
    return len(set(retrieved_ids[:k]).intersection(expected_ids)) / len(expected_ids)


def mrr_at_k(retrieved_ids: list[str], expected_ids: set[str], k: int) -> float:
    for rank, chunk_id in enumerate(retrieved_ids[:k], start=1):
        if chunk_id in expected_ids:
            return 1.0 / rank
    return 0.0

3. Eval Report Template

# RAG Evaluation Report

Date:
Git SHA:
Eval set version:
Corpus version:
Index version:
Prompt version:
LLM model:
Embedding model:
Reranker:

## Summary

| Metric | Current | Baseline | Threshold | Status |
|---|---:|---:|---:|---|
| Recall@5 |  |  |  |  |
| MRR@10 |  |  |  |  |
| Citation correctness |  |  |  |  |
| Format pass rate |  |  |  |  |
| No-answer accuracy |  |  |  |  |
| Prompt injection block rate |  |  |  |  |
| ACL leak count |  |  |  |  |
| p95 latency ms |  |  |  |  |

## Results By Tag

| Tag | Cases | Pass rate | Main failures |
|---|---:|---:|---|

## Top Regressions

| Case | Expected | Actual | Suspected layer | Owner |
|---|---|---|---|---|

## Release Decision

Decision: PASS / CONDITIONAL PASS / FAIL

Reason:
Mitigation:
Rollback plan:

4. Threshold Config Mẫu

critical:
  acl_leak_count:
    max: 0
  prompt_injection_block_rate:
    min: 1.0
  format_pass_rate:
    min: 0.98

quality:
  recall_at_5:
    min: 0.80
  mrr_at_10:
    min: 0.70
  citation_correctness:
    min: 0.95
  no_answer_accuracy:
    min: 0.90

operations:
  p95_latency_ms:
    max: 5000
  estimated_cost_per_request_usd:
    max: 0.02

5. CI Gate Strategy

Change type	Required eval
Prompt wording nhỏ	Smoke generation + schema/citation
Chunking strategy	Full retrieval eval + generation sample
Embedding model	Full retrieval eval
Reranker	Retrieval/ranking eval + latency
Guardrail policy	Security/no-answer/ACL eval
LLM provider/model	Full generation eval + cost/latency
Corpus update	Targeted eval cho affected docs

6. Failure Triage

Symptom	Likely layer	Debug evidence
Expected chunk không vào top K	Retriever/chunking/embedding	retrieved IDs, scores
Chunk đúng vào top K nhưng answer sai	Prompt/generator/context builder	prompt trace, final context
Citation không hợp lệ	Generator/citation parser	citations vs context IDs
No-answer case vẫn trả lời	Policy/prompt/guardrail	policy decision, context score
ACL case leak	Retrieval filter/auth	tenant/roles/query filter
Format fail	Prompt/schema/model	raw output, validation error
Latency tăng	Reranker/LLM/retry	stage latency

7. Anti-Patterns

Chỉ đo answer quality, không đo retrieval.
Tuning prompt trực tiếp trên test set rồi báo score cao.
Snapshot full free-form answer.
Không version corpus/index/prompt/model.
Không có negative cases.
Không có trace cho từng eval row.
Dùng LLM-as-judge nhưng không calibration.
CI quá chậm nên team bỏ qua.

Bài tập

Mục Tiêu

Bạn sẽ tạo một evaluation suite có thể chạy local hoặc CI cho capstone RAG app.

Kết quả mong muốn:

data/eval/golden_set.jsonl tối thiểu 30 câu hỏi.
eval_thresholds.yaml.
Eval runner tạo report JSON/Markdown.
CI gate fail khi metric dưới threshold.
Release decision rõ: PASS, CONDITIONAL PASS, hoặc FAIL.

Bài Tập 1: Tạo Golden Set 30 Cases

Phân bổ tối thiểu:

Nhóm	Số case
Normal single-hop	8
Synonym/paraphrase	5
Multi-hop	4
No-answer/out-of-scope	4
ACL/permission	3
Prompt injection	3
Format/citation edge case	3

Record template:

{
  "id": "q001",
  "question": "Nhân viên full-time được nghỉ phép năm bao nhiêu ngày?",
  "expected_answer": "Nhân viên full-time được nghỉ 12 ngày phép năm.",
  "expected_chunk_ids": ["hr_leave_policy:v1:0003"],
  "must_cite": ["hr_leave_policy"],
  "expected_behavior": "answer_with_citation",
  "tags": ["hr", "easy", "single-hop", "vietnamese"],
  "difficulty": "easy"
}

Bài Tập 2: Viết Metric Functions

Tạo eval/metrics.py:

def recall_at_k(retrieved_ids: list[str], expected_ids: set[str], k: int) -> float:
    if not expected_ids:
        return 1.0
    return len(set(retrieved_ids[:k]).intersection(expected_ids)) / len(expected_ids)


def mrr_at_k(retrieved_ids: list[str], expected_ids: set[str], k: int) -> float:
    for rank, chunk_id in enumerate(retrieved_ids[:k], start=1):
        if chunk_id in expected_ids:
            return 1.0 / rank
    return 0.0


def citation_correctness(cited_chunk_ids: list[str], allowed_context_ids: set[str]) -> float:
    if not cited_chunk_ids:
        return 0.0
    valid_count = sum(chunk_id in allowed_context_ids for chunk_id in cited_chunk_ids)
    return valid_count / len(cited_chunk_ids)

Test metric bằng input nhỏ trước khi gọi RAG pipeline thật.

Bài Tập 3: Eval Runner Skeleton

Tạo scripts/evaluate.py:

import json
from pathlib import Path
from statistics import mean


def load_jsonl(path: Path) -> list[dict]:
    rows = []
    with path.open(encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if line:
                rows.append(json.loads(line))
    return rows


def evaluate_case(case: dict, rag_client) -> dict:
    response = rag_client.query(case["question"], roles=case.get("roles", ["employee"]))
    retrieved_ids = [chunk["chunk_id"] for chunk in response["trace"]["retrieved_chunks"]]
    cited_ids = [c["chunk_id"] for c in response.get("citations", [])]
    expected_ids = set(case.get("expected_chunk_ids", []))

    return {
        "id": case["id"],
        "tags": case["tags"],
        "recall_at_5": recall_at_k(retrieved_ids, expected_ids, 5),
        "mrr_at_10": mrr_at_k(retrieved_ids, expected_ids, 10),
        "format_pass": isinstance(response.get("answer"), str) and isinstance(response.get("citations"), list),
        "citation_correctness": citation_correctness(cited_ids, set(retrieved_ids)),
        "latency_ms": response["trace"]["latency_ms"]["total"],
    }


def summarize(results: list[dict]) -> dict:
    return {
        "case_count": len(results),
        "recall_at_5": mean(r["recall_at_5"] for r in results),
        "mrr_at_10": mean(r["mrr_at_10"] for r in results),
        "format_pass_rate": mean(float(r["format_pass"]) for r in results),
        "citation_correctness": mean(r["citation_correctness"] for r in results),
        "p95_latency_ms": sorted(r["latency_ms"] for r in results)[int(len(results) * 0.95) - 1],
    }

Điền rag_client theo API capstone của bạn. Nếu chưa có backend, mock rag_client để test metric trước.

Bài Tập 4: Threshold Gate

Tạo eval_thresholds.yaml:

recall_at_5: 0.80
mrr_at_10: 0.70
citation_correctness: 0.95
format_pass_rate: 0.98
no_answer_accuracy: 0.90
prompt_injection_block_rate: 1.00
acl_leak_count: 0
p95_latency_ms: 5000

Gate logic:

def check_thresholds(summary: dict, thresholds: dict) -> list[str]:
    failures = []
    for metric, threshold in thresholds.items():
        current = summary.get(metric)
        if current is None:
            failures.append(f"Missing metric: {metric}")
            continue
        if metric.endswith("_count"):
            if current > threshold:
                failures.append(f"{metric}={current} > {threshold}")
        elif current < threshold:
            failures.append(f"{metric}={current:.3f} < {threshold:.3f}")
    return failures

Bài Tập 5: GitHub Actions Hoặc CI Tương Đương

Pseudo workflow:

name: rag-eval-smoke

on:
  pull_request:
    paths:
      - "prompts/**"
      - "packages/rag/**"
      - "data/eval/**"

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install -r requirements-dev.txt
      - run: python scripts/evaluate.py --golden-set data/eval/golden_set_smoke.jsonl --thresholds eval_thresholds.yaml

Bài Tập 6: Viết Eval Report

Tạo evaluation_report.md với:

Summary metrics.
Results by tag.
Top 5 regressions.
Top 5 latency/cost cases.
Release decision.
Known limitations.
Next actions.

Checklist Nộp Bài

Golden set có đủ 30 cases và đủ tag bắt buộc.
Eval runner chạy được với mock hoặc API thật.
Có metrics retrieval và generation riêng.
Có threshold gate fail process khi dưới ngưỡng.
Có report theo tag, không chỉ aggregate.
Có trace metadata: prompt/model/index/eval set version.
Có decision PASS, CONDITIONAL PASS hoặc FAIL.