Day 38: Advanced RAG Patterns Production

1. Bài này nằm ở đâu trong Production RAG?

Từ Day 31 đến Day 37, bạn đã có các khối nền tảng:

Documents
  -> parse
  -> chunk
  -> embed
  -> vector DB / sparse index

User query
  -> hybrid search
  -> rerank
  -> build context
  -> answer with citations

Day 38 trả lời câu hỏi: khi baseline hybrid search + reranking vẫn fail, nên thêm pattern nào, thêm ở đâu, và làm sao chứng minh nó đáng giá?

Điểm quan trọng nhất: advanced RAG là công cụ xử lý lỗi cụ thể, không phải danh sách feature phải bật hết. Pipeline production tốt thường bắt đầu bằng:

query normalization
  -> hybrid search, dense + BM25
  -> RRF merge
  -> reranking
  -> context building with citation
  -> generation
  -> evaluation and trace

Sau đó mới thêm có chọn lọc:

Query rewriting cho query ngắn, sai chính tả, thiếu context chat, synonym hoặc acronym.
Contextual retrieval cho chunk bị mất section/title/table context.
Multi-query khi corpus có nhiều cách diễn đạt và Recall@K còn thấp.
Decomposition hoặc multi-hop khi câu hỏi thật sự cần nhiều bước.
Corrective RAG hoặc agentic RAG khi hệ thống cần tự phát hiện retrieval yếu và thử lại.
GraphRAG khi câu hỏi liên quan entity/relation/global summary, không phải FAQ thông thường.

2. Taxonomy lỗi trước khi chọn pattern

Không chọn pattern bằng cảm giác. Hãy phân loại lỗi trên golden set:

Loại lỗi	Dấu hiệu	Pattern phù hợp
Query quá ngắn	"429 là sao", "policy refund?"	Query rewriting, multi-query
Query sai chính tả/không dấu	"nghi phep nam dc bao nhieu ngay"	Query normalization, query rewriting
Query dùng synonym/acronym	"SLA", "PTO", "churn", "chargeback"	Query rewriting, hybrid search, glossary
Query cần chat history	"nó có áp dụng cho gói Enterprise không?"	Conversational query rewriting
Query cần so sánh	"Pro khác Enterprise về refund thế nào?"	Query decomposition, multi-hop RAG
Chunk thiếu ngữ cảnh	Chunk chỉ ghi "thời hạn là 7 ngày"	Contextual retrieval, parent-child retrieval
Retrieval trả về chunk gần nhưng sai	Top chunks không answerable	Reranking, corrective RAG, better qrels
Cần hiểu quan hệ entity	"A liên quan B qua dự án nào?"	GraphRAG hoặc graph-assisted retrieval
Cần câu hỏi tổng quan trên corpus lớn	"Các chủ đề chính trong tài liệu này là gì?"	GraphRAG community summary

Nếu lỗi đang nằm ở ACL, document freshness, citation hoặc chunking quá tệ, advanced prompt thường không cứu được. Sửa dữ liệu, schema và index lifecycle trước.

3. Baseline production trước khi advanced

Baseline tối thiểu đáng tin:

1. Normalize query
2. Dense retrieval top 50
3. BM25/sparse retrieval top 50
4. Merge bằng Reciprocal Rank Fusion
5. Apply tenant/ACL/deleted/index_version filters trong retriever
6. Rerank top 50 xuống top 5-10
7. Build context có citation metadata
8. Generate answer chỉ dựa trên retrieved evidence
9. Log trace và metrics

Trước khi thêm pattern, cần có:

Golden set 30-100 câu hỏi có expected documents hoặc expected answer.
Query tags: short, keyword, synonym, multi_hop, comparison, table, policy, security_sensitive.
Metrics: Recall@5, Recall@10, MRR@10, nDCG@10, context precision, citation correctness, p50/p95 latency, token cost.
Baseline report để so sánh before/after.

Không có baseline thì mọi advanced pattern chỉ là cảm giác.

4. Query rewriting

Query rewriting biến câu hỏi của user thành query rõ hơn cho retrieval, nhưng không được thay đổi intent.

Ví dụ:

Original: "429 là sao?"
Rewritten: "HTTP 429 Too Many Requests rate limit API request per minute"

Original: "nó có áp dụng cho enterprise không?"
Chat context: user đang hỏi refund policy
Rewritten: "Refund policy có áp dụng cho gói Enterprise không?"

Các dạng rewriting phổ biến:

Dạng	Mục đích	Ví dụ
Normalization	Sửa không dấu, typo, casing	"nghi phep" -> "nghỉ phép"
Expansion	Thêm synonym/acronym	"PTO" -> "paid time off, nghỉ phép"
Conversational rewrite	Bổ sung chat context	"nó" -> "refund policy gói Pro"
Domain rewrite	Dùng thuật ngữ corpus	"bị chặn request" -> "rate limit HTTP 429"

Rule production:

Luôn search cả original query và rewritten query khi query có rủi ro drift.
Rewriter chỉ tạo query retrieval, không trả lời user.
Output phải có schema rõ ràng, ví dụ JSON.
Có giới hạn độ dài và số lượng terms.
Log original, rewritten, rewrite reason và risk flags.
Không đưa instruction độc hại từ user vào system prompt của rewriter.

Prompt contract mẫu:

You rewrite the user question for document retrieval.
Do not answer the question.
Preserve the original intent.
Use Vietnamese with necessary English technical terms.
Return JSON only:
{
  "rewritten_query": "...",
  "reason": "...",
  "risk_flags": ["ambiguous" | "prompt_injection" | "exact_lookup" | "none"]
}

Khi không nên rewrite:

Query là mã lỗi, SKU, order id, invoice id, exact phrase pháp lý.
User yêu cầu trích nguyên văn một điều khoản.
Rewriter không đủ context để disambiguate.

Với exact lookup, giữ nguyên query là tín hiệu quan trọng cho BM25.

5. Multi-query retrieval

Multi-query tạo nhiều biến thể của cùng intent, retrieve từng biến thể rồi merge kết quả.

User query
  -> q1 original
  -> q2 rewritten
  -> q3 synonym variant
  -> q4 domain terminology variant
  -> retrieve each query
  -> RRF merge
  -> rerank

Pattern này hữu ích khi corpus có nhiều cách viết:

Tài liệu tiếng Việt + English mix.
Cùng khái niệm có nhiều synonym.
Tài liệu do nhiều team viết, wording không thống nhất.
Query của user ngắn nhưng intent vẫn rõ.

Rủi ro:

Tăng số retrieval calls.
Tăng noise nếu variants đi xa khỏi intent.
Tăng latency và context pollution.
Debug khó nếu không log từng variant.

Reciprocal Rank Fusion, RRF, thường là cách merge đơn giản và ổn:

from collections import defaultdict
from dataclasses import dataclass


@dataclass(frozen=True)
class Candidate:
    chunk_id: str
    text: str
    source_uri: str
    rank: int
    score: float
    retriever: str
    query_variant: str
    metadata: dict


def rrf_merge(result_lists: list[list[Candidate]], k: int = 60) -> list[Candidate]:
    scores: dict[str, float] = defaultdict(float)
    best: dict[str, Candidate] = {}

    for results in result_lists:
        for rank, item in enumerate(results, start=1):
            scores[item.chunk_id] += 1.0 / (k + rank)
            if item.chunk_id not in best or rank < best[item.chunk_id].rank:
                best[item.chunk_id] = item

    merged = sorted(best.values(), key=lambda item: scores[item.chunk_id], reverse=True)
    return [
        Candidate(
            chunk_id=item.chunk_id,
            text=item.text,
            source_uri=item.source_uri,
            rank=index,
            score=scores[item.chunk_id],
            retriever="rrf",
            query_variant=item.query_variant,
            metadata=item.metadata,
        )
        for index, item in enumerate(merged, start=1)
    ]

Production guardrail:

Giới hạn 2-4 variants.
Dedupe theo normalized query.
Nếu query chứa ID/code, không sinh biến thể làm mất ID/code.
Rerank sau khi merge, không đưa thẳng tất cả chunks vào LLM.

6. HyDE

HyDE, viết tắt của Hypothetical Document Embeddings, tạo một đoạn tài liệu giả định có thể trả lời câu hỏi, embed đoạn đó, rồi dùng embedding để retrieve tài liệu thật.

query
  -> generate hypothetical answer/document
  -> embed hypothetical document
  -> vector search
  -> retrieve real chunks
  -> answer using real chunks only

HyDE có ích khi:

Query quá ngắn hoặc không giống style trong corpus.
Corpus viết theo dạng policy/documentation, user hỏi rất casual.
Dense embedding của câu hỏi ngắn không đủ tín hiệu.

Nhưng HyDE có rủi ro lớn:

Hypothetical document có thể hallucinate terms sai.
Có thể kéo retrieval về vùng kiến thức sai.
Dễ gây hiểu nhầm nếu team dùng HyDE output như evidence.

Rule bắt buộc: HyDE output không bao giờ là citation, không bao giờ là source of truth. Nó chỉ là query artifact.

Không nên dùng HyDE mặc định cho:

Legal/compliance exact wording.
Medical/finance high-risk answer.
Query cần trích điều khoản cụ thể.
Corpus nhỏ đã retrieval tốt.

7. Step-back prompting

Step-back prompting tạo một câu hỏi tổng quát hơn để tìm context nền, sau đó kết hợp với retrieval trực tiếp.

Ví dụ:

Original: "Có được refund sau 10 ngày không?"
Step-back: "Refund policy điều kiện hoàn tiền và thời hạn áp dụng"

Pipeline an toàn:

retrieve(original query)
retrieve(step-back query)
merge
rerank
answer with exact evidence

Không bỏ retrieval trực tiếp, vì step-back có thể làm mất detail như plan name, date, jurisdiction hoặc version.

Step-back phù hợp với:

Policy concept.
Incident troubleshooting từ triệu chứng sang runbook category.
Câu hỏi cần background trước khi trả lời detail.

Không phù hợp với:

Lookup theo mã.
Câu hỏi cần số liệu cụ thể.
Câu hỏi có exact quote.

8. Query decomposition và multi-hop RAG

Query decomposition tách câu hỏi phức tạp thành nhiều subqueries. Multi-hop RAG retrieve và tổng hợp evidence qua nhiều bước.

Ví dụ:

Question: "Chính sách refund gói Pro khác Enterprise thế nào?"

Subquery 1: "Refund policy for Pro plan"
Subquery 2: "Refund policy for Enterprise plan"
Synthesis: compare conditions, time window, exceptions and citations

Điểm khác nhau:

Decomposition là việc tách câu hỏi.
Multi-hop RAG là pipeline dùng kết quả hop trước để quyết định hop sau hoặc tổng hợp nhiều hop.

Production requirements:

Mỗi subquery có trace riêng.
Mỗi claim trong final answer map về source của subquery tương ứng.
Có giới hạn số subqueries, thường 2-5.
Có fallback hỏi lại user nếu decomposition mơ hồ.
Không dùng multi-hop cho FAQ đơn giản.

Sơ đồ:

user question
  -> classify as multi-hop?
  -> decompose into subqueries
  -> retrieve + rerank per subquery
  -> evidence table
  -> synthesize answer
  -> verify citations cover each claim

Với production, nhiều hệ thống chỉ cần decomposition dạng deterministic cho comparison query:

"A khác B về X thế nào?"
  -> retrieve X for A
  -> retrieve X for B
  -> compare

Không cần agent loop nếu pattern của câu hỏi ổn định.

9. Contextual retrieval

Contextual retrieval thêm ngữ cảnh vào chunk ở indexing time để chunk độc lập hơn khi search.

Vấn đề:

Chunk text: "Thời hạn là 7 ngày kể từ ngày mua."

Chunk này không nói 7 ngày cho cái gì. Nếu embed nguyên chunk, retrieval rất yếu.

Contextual chunk:

Document: Refund Policy 2026
Section: Gói Pro > Điều kiện hoàn tiền
Summary: Quy định hoàn tiền cho khách hàng gói Pro.
Text: Thời hạn là 7 ngày kể từ ngày mua.

Code indexing gần production:

from dataclasses import dataclass


@dataclass(frozen=True)
class RawChunk:
    document_id: str
    chunk_id: str
    title: str
    section_path: list[str]
    page_start: int | None
    text: str
    metadata: dict


def build_contextual_text(chunk: RawChunk) -> str:
    section = " > ".join(chunk.section_path) if chunk.section_path else "Unknown section"
    page = f"Page: {chunk.page_start}" if chunk.page_start else "Page: unknown"
    return "\n".join(
        [
            f"Document: {chunk.title}",
            f"Section: {section}",
            page,
            "Purpose: Retrieval context only. Use original source text for citation.",
            f"Text: {chunk.text}",
        ]
    )


def build_vector_record(chunk: RawChunk, embedding: list[float], index_version: str) -> dict:
    return {
        "id": f"{chunk.document_id}:{chunk.chunk_id}:{index_version}",
        "text": chunk.text,
        "contextual_text": build_contextual_text(chunk),
        "embedding": embedding,
        "metadata": {
            **chunk.metadata,
            "document_id": chunk.document_id,
            "chunk_id": chunk.chunk_id,
            "title": chunk.title,
            "section_path": chunk.section_path,
            "page_start": chunk.page_start,
            "index_version": index_version,
            "contextual_strategy": "title_section_page_v1",
        },
    }

Lưu ý quan trọng:

Embed contextual_text, nhưng khi hiển thị citation nên trỏ về text và source document thật.
Nếu context được LLM generate, phải version prompt và reindex khi đổi prompt.
Context sai có thể làm retrieval sai hàng loạt.
Context dài làm tăng embedding cost và index size.

Contextual retrieval thường là pattern đáng thử sớm nhất sau hybrid + rerank vì nó cải thiện chất lượng offline, không thêm LLM call vào online path.

10. Corrective RAG

Corrective RAG kiểm tra chất lượng retrieved context trước khi answer. Nếu context yếu, hệ thống rewrite/retrieve lại, mở rộng search hoặc hỏi user.

retrieve
  -> grade context quality
  -> if good: answer
  -> if weak: rewrite + retrieve again
  -> if still weak: ask clarification or answer "không đủ thông tin"

Grader có thể là:

Rule-based: không có chunk đủ score, source không đúng tenant, citation thiếu.
Model-based: LLM đánh giá context có answerable không.
Hybrid: rule trước, LLM chỉ dùng cho case khó.

Production controls:

Max retry thường là 1.
Timeout tổng cho retrieval path.
Log reason: low_recall, low_rerank_score, conflicting_sources, no_citation.
Không để corrective loop chạy không giới hạn.

Corrective RAG có ích khi user experience quan trọng hơn latency tuyệt đối, ví dụ internal assistant, customer support hoặc analyst workflow. Với API latency chặt, có thể chỉ dùng rule-based fallback.

11. Agentic RAG

Agentic RAG cho LLM quyết định gọi retrieval tools nhiều lần, có thể chọn tool khác nhau như vector search, SQL search, web search nội bộ, graph search hoặc code search.

Nên dùng khi task thật sự cần:

Multi-step reasoning.
Chọn tool tùy tình huống.
Kết hợp nhiều data source.
Lập kế hoạch và verify từng bước.

Không nên dùng agentic RAG cho FAQ đơn giản vì:

Latency khó dự đoán.
Cost có worst case cao.
Debug phức tạp.
Dễ loop nếu stop condition kém.
Security surface lớn hơn.

Checklist bắt buộc:

Tool allowlist.
Max steps.
Timeout.
Cost budget.
Trace từng tool call.
Tenant/ACL filter ở tool layer.
Eval regression theo scenario.
Human-readable execution trace cho debugging.

Một lựa chọn thực dụng: thay vì agent tự do, dùng orchestrator có state machine rõ ràng:

classify_query
  -> direct_retrieval | comparison_retrieval | troubleshooting_retrieval
  -> fixed steps
  -> answer

State machine dễ test hơn agent loop mở.

12. GraphRAG overview

GraphRAG xây graph từ corpus:

documents
  -> extract entities and relations
  -> build graph
  -> community detection / summaries
  -> graph search + vector search
  -> answer

GraphRAG phù hợp khi câu hỏi liên quan:

Entity relationship: "Project A liên quan team B qua incident nào?"
Global summary: "Các chủ đề chính trong tập tài liệu này là gì?"
Community-level analysis: "Nhóm rủi ro lớn nhất trong contract corpus là gì?"
Long corpus có nhiều cross-reference.

Không nên dùng GraphRAG nếu bài toán là:

FAQ hoặc policy lookup đơn giản.
Corpus nhỏ.
Tài liệu thay đổi liên tục nhưng chưa có graph update pipeline.
Team chưa có evaluation cho entity extraction/relation extraction.

Trade-off:

Build index đắt hơn.
Update/delete phức tạp hơn.
Graph extraction có lỗi riêng.
Cần eval cả graph quality, không chỉ answer quality.
Có thể phải lưu thêm community summaries và provenance.

GraphRAG trong Day 38 chỉ là overview. Với project Day 40, chỉ nên thêm GraphRAG nếu golden set có nhiều câu hỏi entity/global mà vector + BM25 + rerank không đủ.

13. Thiết kế pipeline gần production

Ví dụ dưới đây minh họa orchestration cho retrieval path. Đây không phải framework hoàn chỉnh, nhưng thể hiện các boundary quan trọng: policy, trace, tenant/ACL, query variants, merge, rerank và fallback.

from __future__ import annotations

import time
from dataclasses import dataclass, field
from typing import Protocol


@dataclass(frozen=True)
class RetrievalPolicy:
    max_query_variants: int = 3
    dense_top_k: int = 50
    sparse_top_k: int = 50
    final_top_k: int = 8
    enable_rewrite: bool = True
    enable_multi_query: bool = False
    timeout_ms: int = 2500


@dataclass(frozen=True)
class SearchRequest:
    query: str
    tenant_id: str
    acl_roles: tuple[str, ...]
    chat_summary: str | None = None
    index_version: str = "active"


@dataclass
class SearchTrace:
    original_query: str
    rewritten_query: str | None = None
    query_variants: list[str] = field(default_factory=list)
    retrieved_count: int = 0
    reranked_count: int = 0
    fallback_used: str | None = None
    latency_ms: int = 0
    warnings: list[str] = field(default_factory=list)


class QueryRewriter(Protocol):
    def rewrite(self, request: SearchRequest) -> str | None:
        ...


class Retriever(Protocol):
    def search(
        self,
        query: str,
        tenant_id: str,
        acl_roles: tuple[str, ...],
        index_version: str,
        top_k: int,
    ) -> list[Candidate]:
        ...


class Reranker(Protocol):
    def rerank(self, query: str, candidates: list[Candidate], top_k: int) -> list[Candidate]:
        ...


def dedupe_queries(queries: list[str], max_count: int) -> list[str]:
    seen: set[str] = set()
    output: list[str] = []
    for query in queries:
        normalized = " ".join(query.strip().lower().split())
        if not normalized or normalized in seen:
            continue
        seen.add(normalized)
        output.append(query.strip())
        if len(output) >= max_count:
            break
    return output


class AdvancedRagRetriever:
    def __init__(
        self,
        dense: Retriever,
        sparse: Retriever,
        reranker: Reranker,
        rewriter: QueryRewriter | None,
        policy: RetrievalPolicy,
    ) -> None:
        self.dense = dense
        self.sparse = sparse
        self.reranker = reranker
        self.rewriter = rewriter
        self.policy = policy

    def retrieve(self, request: SearchRequest) -> tuple[list[Candidate], SearchTrace]:
        started = time.monotonic()
        trace = SearchTrace(original_query=request.query)

        queries = [request.query]
        if self.policy.enable_rewrite and self.rewriter:
            rewritten = self.rewriter.rewrite(request)
            if rewritten and rewritten.strip() != request.query.strip():
                trace.rewritten_query = rewritten
                queries.append(rewritten)

        query_variants = dedupe_queries(queries, self.policy.max_query_variants)
        trace.query_variants = query_variants

        result_lists: list[list[Candidate]] = []
        for query in query_variants:
            elapsed_ms = int((time.monotonic() - started) * 1000)
            if elapsed_ms > self.policy.timeout_ms:
                trace.fallback_used = "timeout_before_all_variants"
                break

            result_lists.append(
                self.dense.search(
                    query=query,
                    tenant_id=request.tenant_id,
                    acl_roles=request.acl_roles,
                    index_version=request.index_version,
                    top_k=self.policy.dense_top_k,
                )
            )
            result_lists.append(
                self.sparse.search(
                    query=query,
                    tenant_id=request.tenant_id,
                    acl_roles=request.acl_roles,
                    index_version=request.index_version,
                    top_k=self.policy.sparse_top_k,
                )
            )

        merged = rrf_merge(result_lists)
        trace.retrieved_count = len(merged)

        if not merged:
            trace.warnings.append("no_retrieval_result")
            trace.latency_ms = int((time.monotonic() - started) * 1000)
            return [], trace

        reranked = self.reranker.rerank(
            query=request.query,
            candidates=merged,
            top_k=self.policy.final_top_k,
        )
        trace.reranked_count = len(reranked)
        trace.latency_ms = int((time.monotonic() - started) * 1000)
        return reranked, trace

Trong code thật, bạn cần thêm:

Circuit breaker cho LLM rewriter.
Cache theo tenant_id, acl_hash, index_version, normalized query.
Structured logging và distributed tracing.
Redaction cho logs chứa query nhạy cảm.
Eval job chạy trước khi bật feature flag.

14. Performance và cost trade-off

Pattern	Online LLM call?	Tăng latency	Tăng cost	Rủi ro chính	Ghi chú
Query rewriting	Có	Thấp-vừa	Thấp	Drift intent	Cache được
Multi-query	Thường có	Vừa-cao	Vừa	Noise, nhiều retrieval calls	Giới hạn variants
HyDE	Có	Vừa-cao	Vừa	Hallucinated retrieval anchor	Không dùng làm evidence
Step-back	Có	Vừa	Thấp-vừa	Mất detail	Search direct + step-back
Decomposition	Có	Cao	Vừa-cao	Sai subquery	Trace từng subquery
Contextual retrieval	Offline hoặc indexing time	Không tăng online nhiều	Tăng index cost	Context sai/stale	Rất đáng thử
Corrective RAG	Có thể có	Vừa-cao	Vừa	Retry loop	Max retry
Agentic RAG	Có	Khó đoán	Cao	Loop, tool misuse	Chỉ dùng có kiểm soát
GraphRAG	Offline + online tùy thiết kế	Vừa-cao	Cao	Graph stale/sai relation	Dùng cho entity/global query

Rule thực dụng:

Nếu p95 latency dưới 2 giây là bắt buộc, tránh nhiều online LLM calls trên retrieval path.
Nếu corpus có chunk mất context, ưu tiên contextual retrieval vì chi phí nằm ở indexing time.
Nếu query set có nhiều synonym/acronym, dùng rewrite + original search trước khi bật multi-query.
Nếu câu hỏi multi-hop ít hơn 5-10% traffic, có thể route riêng thay vì làm mọi query đi qua decomposition.

15. Evaluation gate

Mỗi pattern mới phải qua decision gate:

| Pipeline | Recall@5 | MRR@10 | Context precision | Citation accuracy | p95 latency | Cost/query | Decision |
|---|---:|---:|---:|---:|---:|---:|---|
| baseline hybrid + rerank | | | | | | | |
| + query rewrite | | | | | | | |
| + contextual retrieval | | | | | | | |
| + multi-query | | | | | | | |

Không chỉ nhìn aggregate. Hãy report theo tag:

short
synonym
acronym
comparison
multi_hop
exact_lookup
policy
table
security_sensitive

Một pattern được giữ khi:

Cải thiện rõ trên nhóm lỗi mục tiêu.
Không làm giảm đáng kể nhóm query đang tốt.
p95 latency và cost còn trong budget.
Không làm hỏng citation hoặc permission.
Có trace đủ để debug.

16. Best practices

Bắt đầu từ hybrid search + reranker.
Đừng dùng agentic RAG để che lấp retriever yếu.
Luôn giữ original query trong retrieval set.
Version prompt cho rewrite, HyDE, decomposition và contextual enrichment.
Không để generated text trở thành source citation.
Rerank sau khi merge multi-query.
Có feature flag để bật/tắt từng pattern.
Có fallback về baseline khi LLM rewriter timeout.
Trace mọi variants và retrieved chunks.
Đánh giá theo category, không chỉ điểm trung bình.

17. Dùng được trong production không?

Có, nhưng không phải bằng cách bật mọi pattern.

Production-ready khi có đủ điều kiện:

Baseline hybrid + rerank đã chạy ổn và có golden set.
Advanced pattern được gắn với lỗi cụ thể trong golden set.
Có before/after report về quality, latency và cost.
Có tenant/ACL filter ở retriever layer.
Có citation đúng source thật, không cite rewritten query hoặc HyDE text.
Có timeout, retry limit, cache và fallback.
Có tracing cho từng bước retrieval.
Có prompt/version/index version rõ ràng.
Có monitoring sau khi rollout: no-answer rate, citation error, retrieval latency, cost/query, user feedback.

Không production-ready nếu:

Chưa có eval.
Không biết pattern nào đang cải thiện lỗi nào.
Agent loop không có max steps.
Query rewrite có thể thay đổi intent mà không được phát hiện.
Contextual chunk không có version và reindex path.
GraphRAG index không có update/delete strategy.

18. Checklist cuối bài

Giải thích được query rewriting, multi-query, HyDE và step-back khác nhau thế nào.
Biết query decomposition khác multi-hop RAG ở đâu.
Biết contextual retrieval cải thiện chunk mất context bằng cách nào.
Biết khi nào không nên dùng HyDE hoặc agentic RAG.
Có thể thiết kế trace cho original query, rewritten query, variants, retrieved chunks và reranked chunks.
Có decision report trước khi giữ một advanced pattern.
Trả lời được điều kiện production readiness.

19. Câu hỏi ôn tập

Vì sao nên search cả original query và rewritten query?
Khi nào multi-query retrieval làm Recall@K tăng nhưng context precision giảm?
Vì sao HyDE output không được dùng làm evidence?
Step-back prompting khác query rewriting ở điểm nào?
Query decomposition cần lưu trace thế nào để final answer có citation đúng?
Vì sao contextual retrieval thường đáng thử trước agentic RAG?
Khi nào GraphRAG đáng đầu tư?
Nếu p95 latency tăng 3 lần nhưng Recall@5 chỉ tăng 1%, bạn quyết định thế nào?

Tài liệu

1. Mental model nhanh

Advanced RAG là lớp tối ưu sau baseline, không phải baseline.

Baseline:
query -> hybrid search -> RRF -> rerank -> context -> answer + citation

Advanced:
query understanding
  -> better retrieval queries
  -> better indexed chunks
  -> optional multi-step retrieval
  -> quality check
  -> answer with trace

Nguyên tắc mặc định:

Sửa chunking, metadata, ACL và hybrid search trước.
Thêm query rewriting nếu query của user thiếu rõ ràng.
Thêm contextual retrieval nếu chunk mất context.
Thêm multi-query nếu synonym/acronym làm Recall thấp.
Thêm decomposition hoặc agentic flow chỉ cho query cần nhiều bước.

2. Decision matrix

Pattern	Giải quyết tốt	Không nên dùng khi	Production default
Query rewriting	Query ngắn, typo, không dấu, chat history	Exact ID, exact quote, legal wording	Nên thử sớm
Multi-query retrieval	Synonym, acronym, wording đa dạng	SLA chặt, corpus nhỏ, query exact	Có điều kiện
HyDE	Query quá ngắn, style user khác corpus	High-risk exact answer	Không mặc định
Step-back prompting	Cần context khái niệm chung	Lookup mã/SKU/order	Có điều kiện
Query decomposition	So sánh, nhiều điều kiện	FAQ đơn giản	Route riêng
Multi-hop RAG	Cần evidence từ nhiều tài liệu	Câu hỏi một bước	Route riêng
Contextual retrieval	Chunk mất title/section/table context	Corpus đã sạch, chunk đủ nghĩa	Nên thử sớm
Corrective RAG	Context thường yếu hoặc thiếu	Latency rất chặt	Có điều kiện
Agentic RAG	Tool choice/multi-step phức tạp	Q&A đơn giản	Không mặc định
GraphRAG	Entity relation/global summary	FAQ/policy lookup	Chỉ khi có nhu cầu rõ

3. Query routing cheat sheet

Query tag	Ví dụ	Route gợi ý
`exact_lookup`	"ERR-1042 nghĩa là gì?"	Original query + BM25 + rerank
`short`	"429 là sao?"	Original + rewrite
`synonym`	"nghỉ phép có lương"	Rewrite + hybrid
`acronym`	"PTO policy"	Rewrite với glossary + hybrid
`conversation`	"nó áp dụng cho Enterprise không?"	Conversational rewrite
`comparison`	"Pro khác Enterprise thế nào?"	Decomposition 2 subqueries
`multi_hop`	"Ai approve policy ảnh hưởng incident X?"	Multi-hop hoặc agentic route
`global`	"Các theme chính của corpus là gì?"	GraphRAG hoặc offline summary
`security_sensitive`	"Lương của team finance?"	Strict ACL, no broad rewrite

4. Prompt contract: query rewriting

System:
You rewrite user questions for retrieval over an internal knowledge base.
Do not answer the question.
Preserve intent, constraints, names, IDs, dates and plan names.
If the query is exact lookup, keep it unchanged.
Return JSON only.

Input:
- User query: {query}
- Chat summary: {chat_summary}
- Domain glossary: {glossary}

Output schema:
{
  "rewritten_query": "string",
  "should_search_original": true,
  "reason": "string",
  "risk_flags": ["none" | "ambiguous" | "exact_lookup" | "prompt_injection"]
}

Validation:

Reject output nếu không parse được JSON.
Reject nếu rewritten query dài hơn giới hạn, ví dụ 300 ký tự.
Reject nếu mất mã định danh quan trọng từ original query.
Nếu risk_flags chứa prompt_injection, chỉ search original hoặc hỏi lại user.

5. Prompt contract: multi-query

System:
Generate retrieval query variants with the same intent.
Do not introduce new facts.
Keep IDs, dates, product names and constraints unchanged.
Return JSON only.

Output schema:
{
  "variants": [
    {"query": "string", "purpose": "synonym|acronym|domain_term|vietnamese_english_mix"}
  ]
}

Guardrails:

Tối đa 3 variants.
Dedupe normalized text.
Bỏ variant không giữ constraints.
Rerank sau khi RRF merge.

6. Prompt contract: HyDE

System:
Write a hypothetical internal documentation paragraph that could answer the query.
This paragraph is used only to improve retrieval embedding.
Do not include citations.
Do not invent product names, dates or legal clauses.
Return plain text, maximum 120 words.

Runbook:

Embed HyDE text để retrieve.
Không hiển thị HyDE text cho user.
Không dùng HyDE text làm citation.
Nếu domain high-risk, tắt HyDE hoặc route qua review.

7. Prompt contract: step-back

System:
Create one broader conceptual retrieval query.
Preserve the domain and main constraint.
Do not remove product names, jurisdiction, dates or policy version if present.
Return JSON only.

Output:
{"step_back_query": "string", "reason": "string"}

Rule:

Retrieve both original and step-back query.
Khi final answer cần số cụ thể, ưu tiên evidence từ original query.

8. Prompt contract: decomposition

System:
Decompose the user question into minimal retrieval subqueries.
Use decomposition only if the answer requires comparing or combining multiple facts.
Return JSON only.

Output:
{
  "requires_decomposition": true,
  "subqueries": [
    {"id": "q1", "query": "string", "expected_evidence": "string"}
  ],
  "synthesis_instruction": "string"
}

Validation:

Tối đa 5 subqueries.
Mỗi subquery phải bám sát original query.
Nếu câu hỏi có hai entity A/B, subqueries phải giữ A/B rõ ràng.
Final answer phải có evidence map theo subquery.

9. Grader cho corrective RAG

Rule-based signals:

retrieved_count == 0
top_rerank_score < threshold
Top chunks đến từ source cũ hơn active version.
Citation thiếu source_uri hoặc page.
Query hỏi "so sánh" nhưng chỉ có evidence cho một bên.
Query chứa tenant/user-sensitive terms nhưng result thiếu ACL metadata.

LLM grader chỉ nên trả schema:

{
  "answerable": true,
  "missing_evidence": [],
  "conflicting_sources": false,
  "recommended_action": "answer|rewrite_and_retry|ask_clarification|refuse"
}

10. Observability fields

Log structured trace cho mỗi request:

Field	Mục đích
`request_id`	Trace end-to-end
`tenant_id`	Multi-tenancy debug, cần redaction policy
`user_role_hash`	Không log raw role nhạy cảm nếu không cần
`original_query`	Debug intent
`rewritten_query`	Debug rewrite
`query_variants`	Debug multi-query
`retriever_top_k`	Reproduce retrieval
`retrieved_chunk_ids`	Reproduce context
`reranked_chunk_ids`	Debug reranker
`citations`	Kiểm tra answer grounding
`index_version`	Debug stale index
`prompt_versions`	Debug behavior drift
`latency_breakdown_ms`	Performance
`llm_calls`	Cost
`fallback_used`	Reliability
`eval_tags`	Report theo category

Không log raw confidential document text nếu chưa có redaction và retention policy.

11. Performance budget mẫu

Stage	Budget p95 gợi ý	Ghi chú
Query rewrite	200-600 ms	Cache nếu query phổ biến
Dense retrieval	50-250 ms	Phụ thuộc Vector DB và filters
BM25/sparse retrieval	30-200 ms	Có thể chạy song song với dense
RRF merge	< 20 ms	CPU local
Rerank top 50	200-900 ms	Cross-encoder có thể đắt
Context build	< 50 ms	Dedupe, trim, citation
Corrective retry	+300-1500 ms	Chỉ khi cần
Generation	500-3000 ms	Phụ thuộc model và output length

Nếu p95 target là 2 giây, retrieval path không nên có nhiều hơn 1 online LLM call trước generation, trừ khi chạy async hoặc dùng model rất nhanh.

12. Cost estimation nhanh

cost_per_query =
  rewrite_llm_cost
  + multi_query_generation_cost
  + retrieval_calls * retrieval_unit_cost
  + rerank_cost
  + generation_cost

Ví dụ route:

Route	LLM calls trước answer	Retrieval calls	Khi dùng
Baseline	0	2, dense + sparse	Default
Rewrite	1	4, original/rewrite x dense/sparse	Query ngắn/mơ hồ
Multi-query 3 variants	1	8, 4 queries x 2 retrievers	Synonym nặng
Decomposition 3 subqueries	1+	6+, mỗi subquery dense/sparse	Comparison/multi-hop
Corrective retry	1-2+	x2 worst case	Context yếu

Cost tăng tuyến tính theo số variants/subqueries nếu không có cache và routing.

13. Security notes

Prompt không phải security boundary. ACL phải ở retriever/database layer.
Query rewrite không được thêm tenant, role hoặc permission từ user input.
Cache key phải chứa tenant_id, acl_hash, index_version và normalized query.
Không cache kết quả retrieval cross-tenant.
Generated query, HyDE text và graph summary đều là derived artifacts, không phải source truth.
Với right-to-delete, phải xóa hoặc invalidate contextual chunks, graph nodes và summaries liên quan.
Agentic tools phải có allowlist và kiểm tra quyền riêng ở từng tool.

14. Decision report template

# Advanced RAG Decision Report

## Change
- Pattern:
- Prompt/index version:
- Target query category:
- Rollout flag:

## Baseline problem
- Failing examples:
- Root cause:
- Current metrics:

## Before/after metrics
| Segment | Pipeline | Recall@5 | MRR@10 | Context precision | Citation accuracy | p95 latency | Cost/query |
|---|---|---:|---:|---:|---:|---:|---:|
| short | baseline | | | | | | |
| short | proposed | | | | | | |

## Risks
- Intent drift:
- Context pollution:
- Security/ACL:
- Cost/latency:
- Ops complexity:

## Decision
- Keep / rollback / limited rollout:
- Reason:
- Monitoring:
- Owner:

15. Rollout plan

Offline eval trên golden set.
Shadow mode: log proposed retrieval nhưng không dùng để answer.
Compare traces với baseline.
Internal canary 5-10% traffic.
Monitor no-answer rate, feedback, p95 latency, cost/query, citation error.
Rollout theo feature flag.
Có rollback một config, không cần redeploy.

16. Debug runbook

Khi answer sai, hỏi theo thứ tự:

Query có bị rewrite sai intent không?
Original query có được search không?
Retriever có filter đúng tenant/ACL/index_version không?
Dense hay BM25 tìm được expected document?
RRF merge có đẩy expected document xuống quá thấp không?
Reranker có loại nhầm expected chunk không?
Context builder có cắt mất evidence không?
Generator có bỏ qua citation hoặc hallucinate không?
Pattern mới có làm regression ở query tag khác không?

17. Production readiness checklist

18. Câu trả lời production readiness ngắn

Dùng được trong production nếu pattern được chọn dựa trên lỗi đo được, có eval before/after, có trace, có fallback, giữ citation từ source thật và không phá vỡ tenant/ACL. Không nên productionize advanced RAG bằng cách bật đồng loạt query rewrite, multi-query, HyDE, agentic loop và GraphRAG khi chưa có evidence chúng cải thiện chất lượng hơn phần cost/latency/risk tăng thêm.

Bài tập

Mục tiêu

Sau bài tập này bạn sẽ có một mini report chứng minh pattern nào đáng giữ cho RAG pipeline. Trọng tâm là 3 pattern gần production nhất:

Query rewriting.
Multi-query retrieval có RRF merge.
Contextual retrieval.

HyDE, step-back, decomposition, corrective RAG và GraphRAG là phần mở rộng tùy thời gian.

Thời lượng đề xuất: 120-180 phút.

1. Điều kiện chuẩn bị

Bạn cần một baseline từ Day 36/37:

user query
  -> dense retrieval top_k
  -> BM25 retrieval top_k
  -> RRF merge
  -> rerank top_n
  -> final contexts

Nếu chưa có code đầy đủ, có thể làm bài tập ở mức notebook/script với corpus nhỏ và scoring thủ công.

Yêu cầu tối thiểu:

Python 3.10+.
Một embedding model hoặc mock embedding ổn định.
Một BM25 implementation, ví dụ rank-bm25, hoặc sparse search tự viết đơn giản.
Một reranker, hoặc mock reranker dựa trên expected keyword nếu bạn chỉ tập trung vào orchestration.

2. Tạo mini corpus

Tạo 12-20 chunks mô phỏng enterprise knowledge base. Mỗi chunk cần metadata:

CORPUS = [
    {
        "chunk_id": "refund_pro_001",
        "document_id": "refund_policy_2026",
        "title": "Refund Policy 2026",
        "section_path": ["Plans", "Pro"],
        "text": "Khách hàng gói Pro được hoàn tiền trong vòng 7 ngày kể từ ngày mua nếu chưa vượt quá 100 API calls.",
        "metadata": {
            "tenant_id": "company_a",
            "acl_roles": ["support", "sales"],
            "source_uri": "kb://refund_policy_2026#pro",
            "index_version": "day38-v1",
        },
    },
    {
        "chunk_id": "refund_enterprise_001",
        "document_id": "refund_policy_2026",
        "title": "Refund Policy 2026",
        "section_path": ["Plans", "Enterprise"],
        "text": "Gói Enterprise không áp dụng hoàn tiền tự động. Mọi yêu cầu refund cần được account manager phê duyệt.",
        "metadata": {
            "tenant_id": "company_a",
            "acl_roles": ["support", "sales"],
            "source_uri": "kb://refund_policy_2026#enterprise",
            "index_version": "day38-v1",
        },
    },
    {
        "chunk_id": "rate_limit_429_001",
        "document_id": "api_error_guide",
        "title": "API Error Guide",
        "section_path": ["HTTP errors", "429"],
        "text": "HTTP 429 Too Many Requests xảy ra khi client vượt quá rate limit theo phút hoặc theo ngày.",
        "metadata": {
            "tenant_id": "company_a",
            "acl_roles": ["developer", "support"],
            "source_uri": "kb://api_error_guide#429",
            "index_version": "day38-v1",
        },
    },
]

Thêm ít nhất:

3 chunks về billing/payment với synonym như "invoice", "hóa đơn", "thanh toán".
3 chunks về leave/PTO để test acronym.
3 chunks gần giống nhưng khác plan/version để test rerank.
2 chunks của tenant khác để test filter không leak.
2 chunks có text ngắn thiếu context, ví dụ "Thời hạn là 7 ngày", để test contextual retrieval.

3. Tạo golden set

Tạo file hoặc list Python:

GOLDEN_SET = [
    {
        "query": "429 là sao?",
        "tags": ["short", "acronym"],
        "expected_chunk_ids": ["rate_limit_429_001"],
    },
    {
        "query": "gói Pro refund khác Enterprise thế nào?",
        "tags": ["comparison", "multi_hop"],
        "expected_chunk_ids": ["refund_pro_001", "refund_enterprise_001"],
    },
    {
        "query": "PTO của nhân viên full-time là gì?",
        "tags": ["acronym", "synonym"],
        "expected_chunk_ids": ["leave_policy_001"],
    },
]

Bạn cần tối thiểu 20 queries:

5 query ngắn.
5 query synonym/acronym.
4 query comparison.
3 query exact lookup.
3 query chunk thiếu context.

4. Implement metrics

def recall_at_k(results: list[str], expected: set[str], k: int) -> float:
    if not expected:
        return 0.0
    return len(set(results[:k]) & expected) / len(expected)


def reciprocal_rank(results: list[str], expected: set[str]) -> float:
    for index, chunk_id in enumerate(results, start=1):
        if chunk_id in expected:
            return 1.0 / index
    return 0.0


def evaluate(run_results: list[dict], k: int = 5) -> dict:
    recalls = []
    mrrs = []
    for row in run_results:
        result_ids = row["result_chunk_ids"]
        expected = set(row["expected_chunk_ids"])
        recalls.append(recall_at_k(result_ids, expected, k))
        mrrs.append(reciprocal_rank(result_ids, expected))
    return {
        f"recall@{k}": sum(recalls) / len(recalls),
        "mrr": sum(mrrs) / len(mrrs),
    }

Mở rộng nếu có thời gian:

Report theo tag.
Thêm p50/p95 latency.
Thêm estimated cost/query.
Thêm context precision: tỷ lệ chunks trong final context thuộc expected set.

5. Baseline run

Chạy pipeline:

baseline = hybrid search + RRF + rerank

Ghi bảng:

| Query | Tags | Expected | Retrieved top 5 | Recall@5 | RR | Note |
|---|---|---|---|---:|---:|---|

Phân tích 5 lỗi lớn nhất. Với mỗi lỗi, ghi root cause:

Query quá ngắn.
Sai synonym/acronym.
Chunk thiếu context.
Reranker fail.
Expected document bị ACL/index filter loại.
Corpus thiếu dữ liệu.

6. Thêm query rewriting

Implement một rewriter đơn giản trước. Có thể dùng rule hoặc LLM.

Rule-based starter:

GLOSSARY = {
    "pto": "paid time off nghỉ phép có lương",
    "429": "HTTP 429 Too Many Requests rate limit",
    "refund": "hoàn tiền refund",
}


def rewrite_query(query: str) -> str | None:
    lowered = query.lower()
    expansions = [value for key, value in GLOSSARY.items() if key in lowered]
    if not expansions:
        return None
    return f"{query} {' '.join(expansions)}"

Retrieval rule:

queries = [original_query]
rewritten = rewrite_query(original_query)
if rewritten:
    queries.append(rewritten)

Chạy lại eval:

Query rewriting có tăng Recall@5 ở nhóm short, acronym, synonym không?
Có làm giảm nhóm exact_lookup không?
Latency/cost tăng bao nhiêu nếu dùng LLM rewrite?

7. Thêm multi-query retrieval

Tạo variants:

def generate_query_variants(query: str) -> list[str]:
    variants = [query]
    rewritten = rewrite_query(query)
    if rewritten:
        variants.append(rewritten)
    if "refund" in query.lower():
        variants.append(query.replace("refund", "hoàn tiền"))
    return list(dict.fromkeys(variants))[:3]

Mỗi variant chạy dense + BM25, sau đó RRF merge và rerank.

Yêu cầu report:

Số retrieval calls/query.
Recall@5 theo tag.
Context precision.
Ví dụ query cải thiện.
Ví dụ query bị noise.

8. Thêm contextual retrieval

Tạo field contextual_text:

def contextual_text(chunk: dict) -> str:
    section = " > ".join(chunk["section_path"])
    return "\n".join(
        [
            f"Document: {chunk['title']}",
            f"Section: {section}",
            f"Text: {chunk['text']}",
        ]
    )

Index/embed bằng contextual_text, nhưng context gửi vào LLM vẫn nên dùng:

Title, section, source_uri, original text

So sánh:

Baseline embed text.
Contextual embed contextual_text.

Câu hỏi cần trả lời:

Nhóm query nào cải thiện?
Index size/token embedding tăng bao nhiêu?
Có chunk nào bị context sai làm retrieval lệch không?
Có cần reindex version mới không?

9. Optional: HyDE

Tạo hypothetical document ngắn cho query:

def mock_hyde(query: str) -> str:
    return f"Tài liệu nội bộ giải thích về {query}, bao gồm điều kiện áp dụng, giới hạn, ngoại lệ và ví dụ."

Embed HyDE text và retrieve. Report:

HyDE cải thiện query ngắn không?
Có làm retrieval lệch do text quá generic không?
Có đảm bảo HyDE không được dùng làm citation không?

10. Optional: step-back

Tạo step-back query:

def step_back(query: str) -> str | None:
    if "refund" in query.lower() or "hoàn tiền" in query.lower():
        return "refund policy điều kiện hoàn tiền thời hạn ngoại lệ theo gói"
    if "429" in query:
        return "API rate limit HTTP error troubleshooting"
    return None

Retrieve original + step-back. Report case nào step-back giúp tìm background nhưng vẫn cần original để có detail.

11. Optional: decomposition

Với comparison query, tách thủ công:

def decompose(query: str) -> list[str]:
    lowered = query.lower()
    if "pro" in lowered and "enterprise" in lowered and "refund" in lowered:
        return ["Pro plan refund policy", "Enterprise plan refund policy"]
    return [query]

Yêu cầu:

Trace result theo subquery.
Final answer có citation cho từng bên so sánh.
Nếu thiếu evidence cho một bên, answer phải nói thiếu thông tin thay vì đoán.

12. Optional: corrective RAG

Thêm rule:

def should_retry(top_results: list[dict], expected_min_score: float = 0.2) -> bool:
    if not top_results:
        return True
    return top_results[0].get("rerank_score", 0.0) < expected_min_score

Nếu retry:

Dùng rewritten query nếu chưa dùng.
Tăng top_k.
Nếu vẫn yếu, trả về "không đủ thông tin trong tài liệu" hoặc hỏi clarification.

13. Final decision report

Nộp báo cáo:

# Day 38 Advanced RAG Report

## Corpus and golden set
- Number of chunks:
- Number of queries:
- Tags:

## Results
| Pipeline | Recall@5 | MRR | Context precision | p95 latency | Estimated cost/query |
|---|---:|---:|---:|---:|---:|
| baseline hybrid + rerank | | | | | |
| + query rewriting | | | | | |
| + multi-query | | | | | |
| + contextual retrieval | | | | | |

## Decision
- Keep:
- Do not keep:
- Rollout condition:
- Risks:
- Monitoring:

Quy tắc quyết định:

Giữ query rewriting nếu nhóm short/synonym/acronym cải thiện rõ và exact lookup không regression.
Giữ contextual retrieval nếu chunk thiếu context cải thiện mà index cost chấp nhận được.
Chỉ giữ multi-query nếu tăng Recall đáng kể hơn phần latency/cost/noise tăng thêm.
Không giữ HyDE/agentic/decomposition nếu chưa có query category cần chúng.

14. Quiz tự kiểm tra

Vì sao query rewrite không nên thay thế hoàn toàn original query?
RRF giải quyết vấn đề gì khi multi-query tạo nhiều result lists?
Contextual retrieval nên cite contextual_text hay source text gốc?
Khi nào bạn chọn decomposition thay vì multi-query?
Corrective RAG cần giới hạn gì để tránh cost spike?
GraphRAG phù hợp với loại câu hỏi nào trong corpus của bạn?

15. Tiêu chí hoàn thành

Có baseline metrics.
Có ít nhất 20 queries trong golden set.
Có eval theo tag.
Có query rewriting và so sánh before/after.
Có contextual retrieval và so sánh before/after.
Có phân tích trade-off latency/cost.
Có decision report cuối cùng.
Có câu trả lời production readiness cho pipeline bạn chọn.