Day 32: Embedding Models & Benchmark cho tiếng Việt

TL;DR

Embedding biến text thành vector số để text gần nghĩa nằm gần nhau trong vector space. Trong RAG, embedding quyết định retriever có lấy đúng tài liệu trước khi LLM sinh câu trả lời hay không. Với tiếng Việt, không nên chọn model chỉ vì leaderboard hoặc vì model "nghe có vẻ mạnh". Cách đúng là benchmark trên corpus thật, query thật hoặc query giả lập sát production, có qrels rõ ràng, đo Hit@K, Recall@K, MRR, latency, cost, storage, privacy và khả năng vận hành.

Baseline production cho tiếng Việt thường không nên là dense-only. Nên bắt đầu bằng hybrid retrieval: BM25 cho exact keyword, acronym, mã lỗi, số hợp đồng; dense embedding cho semantic match; reranker ở bước sau nếu cần cải thiện thứ tự top results.

1. Embedding là gì?

Embedding là cách biểu diễn một object thành vector số. Trong bài này object là text: query, câu, đoạn văn, chunk tài liệu.

Ví dụ trực giác:

"làm sao xuất hóa đơn VAT" -> [0.12, -0.04, 0.87, ...]
"tôi cần hóa đơn công ty"  -> [0.10, -0.01, 0.82, ...]
"đổi mật khẩu tài khoản"  -> [-0.34, 0.63, 0.02, ...]

Hai câu đầu nói về hóa đơn nên vector của chúng nên gần nhau. Câu thứ ba nói về mật khẩu nên nên nằm xa hơn.

Trong RAG, embedding nằm ở hai pipeline:

Indexing path:
Document -> parse -> chunk -> embedding -> lưu vector + metadata vào index

Query path:
User query -> embedding -> vector search -> top chunks -> rerank/context -> LLM

Embedding không phải là "hiểu ngôn ngữ" theo nghĩa tuyệt đối. Nó là một phép chiếu xác suất học từ dữ liệu huấn luyện. Vì vậy nó có thể tốt với synonym nhưng yếu với exact code, SKU, số hợp đồng, tên riêng hoặc thuật ngữ domain hiếm.

2. Dense vector, sentence embedding và vector space

Dense vector là vector có nhiều chiều và phần lớn chiều có giá trị khác 0. Embedding hiện đại thường là dense vector 384, 768, 1024, 1536, 3072 chiều tùy model.

Sentence embedding là embedding cho cả câu hoặc đoạn text, khác với token embedding bên trong Transformer. Với retrieval, ta thường cần embedding cấp câu/chunk để so sánh query với document chunk.

Điểm quan trọng với Senior SE:

Khái niệm	Cách nghĩ tương tự trong backend
Embedding model	Hàm feature extraction có version
Vector dimension	Schema vật lý ảnh hưởng storage/latency
Vector index	Index database tối ưu nearest-neighbor search
Similarity metric	Hàm ranking, giống ORDER BY theo score
Reindex	Migration lớn, tốn tiền và thời gian
Qrels	Test fixture/golden set cho retrieval

3. Cosine similarity, dot product và normalization

Ba cách đo similarity phổ biến:

Metric	Ý nghĩa	Khi dùng
Cosine similarity	Đo góc giữa hai vector	Phổ biến cho embedding text
Dot product	Tổng tích từng chiều	Tốt khi model yêu cầu hoặc vector đã normalize
Euclidean distance	Khoảng cách hình học	Dùng khi vector DB/model card khuyến nghị

Cosine similarity:

cosine(a, b) = dot(a, b) / (||a|| * ||b||)

Nếu mọi vector đã được normalize về độ dài 1, ranking theo cosine và dot product thường tương đương:

normalize(a) dot normalize(b) == cosine(a, b)

Production rule:

Đọc model card để biết model khuyến nghị cosine, dot product hay L2.
Không trộn vector đã normalize và chưa normalize trong cùng index.
Không so sánh raw score giữa hai model khác nhau như một confidence score.
Metric đánh giá phải là ranking metric, không chỉ là similarity score trung bình.

4. Các nhóm embedding model cần biết

Nhóm	Ví dụ	Điểm mạnh	Điểm yếu
Managed API	OpenAI embedding, Cohere embedding	Ít vận hành, scale nhanh, SLA/provider tooling tốt	Cost theo usage, phụ thuộc network/provider, privacy/data residency
Open-source multilingual	BGE-M3, multilingual-E5	Self-host được, kiểm soát dữ liệu, tốt cho Viet-English mix	Cần serving, batching, monitoring, GPU/CPU capacity
Vietnamese-specific	Vietnamese bi-encoder models	Có thể tốt hơn với tiếng Việt domain-specific	Chất lượng không đồng đều, cần tự benchmark
Domain fine-tuned	Fine-tune từ BGE/E5 bằng qrels nội bộ	Quality cao nếu dữ liệu tốt	Cần dataset, training pipeline, regression eval, model governance

Không có model nào thắng mọi bối cảnh. Legal docs, FAQ support, sản phẩm SaaS, banking, e-commerce và developer docs có failure mode khác nhau.

5. BGE, E5, OpenAI, Cohere khác nhau ở đâu?

OpenAI embedding

Phù hợp khi muốn ship nhanh, giảm ops, không muốn tự host model. Thường là baseline mạnh cho production nếu data policy cho phép gửi query/chunk ra provider. Cần kiểm tra pricing, rate limit, timeout, batch API, data retention và model version.

Cohere embedding

Thường được dùng trong workflow document retrieval và enterprise search. Tương tự managed API: vận hành nhẹ hơn self-host nhưng phải đánh đổi cost, privacy và dependency.

BGE

BGE là họ model embedding/reranking open-source phổ biến. BGE-M3 đáng chú ý vì hỗ trợ multilingual và có hướng dense/sparse/multi-vector trong cùng hệ sinh thái. Phù hợp khi cần self-host hoặc muốn giảm phụ thuộc provider.

E5

E5 là họ model multilingual mạnh cho retrieval. Một chi tiết dễ sai: nhiều model E5 yêu cầu prefix:

query: câu hỏi của user
passage: đoạn tài liệu

Nếu quên prefix, benchmark có thể thấp giả tạo. Đây là ví dụ vì sao benchmark phải lưu cả preprocessing config, không chỉ lưu tên model.

Vietnamese-specific embedding

Nên test nếu corpus chủ yếu là tiếng Việt, có nhiều từ ghép, dấu, không dấu, chính sách nội bộ, thuật ngữ pháp lý/tài chính hoặc câu hỏi support đời thường. Tuy nhiên không nên mặc định rằng model Vietnamese-specific luôn tốt hơn multilingual model. Hãy đo trên qrels của chính mình.

6. Vietnamese retrieval concerns

Tiếng Việt có nhiều case làm dense retrieval sai hoặc thiếu ổn định:

Nhóm vấn đề	Ví dụ	Rủi ro
Có dấu/không dấu	`hóa đơn` vs `hoa don`	User gõ không dấu nhưng tài liệu có dấu
Từ ghép	`bảo mật tài khoản`, `xác thực hai lớp`	Tokenization và semantic match không ổn định
English mix	`reset password`, `invoice VAT`, `rate limit`	Query lai ngôn ngữ
Acronym	`SLA`, `SSO`, `2FA`, `P1`, `VAT`	Dense model có thể bỏ qua exact match
Mã lỗi/số liệu	`HTTP 429`, `99.9%`, `MST`, `SKU`	Cần match chính xác
Synonym	`hoàn tiền`, `hủy gói`, `trả lại tiền`	Dense model giúp nhưng không chắc chắn
OCR/PDF lỗi	`hoa don`, `h0a d0n`, thiếu khoảng trắng	Cần normalize và parser tốt
Domain wording	Pháp lý, tài chính, bảo hiểm	Một từ sai có thể đổi nghĩa

Baseline thực tế:

hybrid retrieval = dense vector search + BM25 + metadata filter

Sau đó mới cân nhắc reranker:

top 50 hybrid results -> reranker -> top 5 context chunks

7. Dimension vs cost vs latency

Vector dimension càng lớn thì storage, memory bandwidth, index size và network payload thường càng tăng.

Ước tính raw vector float32:

storage_bytes = num_chunks * dimension * 4

1,000,000 chunks * 768 dim  * 4 bytes = ~3.1 GB raw vectors
1,000,000 chunks * 1024 dim * 4 bytes = ~4.1 GB raw vectors
1,000,000 chunks * 1536 dim * 4 bytes = ~6.1 GB raw vectors

Đây chưa tính overhead của HNSW/IVF index, metadata, replicas, WAL, backups, compression hoặc quantization.

Trade-off:

Lựa chọn	Lợi ích	Chi phí/rủi ro
Dimension nhỏ	Rẻ hơn, nhanh hơn, index nhỏ hơn	Có thể giảm recall
Dimension lớn	Có thể tăng quality	Tốn storage, RAM, latency, reindex cost
Managed API	Ít ops, time-to-market nhanh	Cost/request, privacy, rate limit
Self-host	Kiểm soát dữ liệu và unit cost ở scale lớn	Cần model serving, autoscale, monitoring
Dense-only	Đơn giản	Yếu với acronym, exact keyword, mã lỗi
Hybrid	Robust hơn cho enterprise docs	Cần merge score, tune weight, vận hành thêm BM25

8. Benchmark design đúng cách

Benchmark tối thiểu cho bài học:

20 queries tiếng Việt
50-100 document chunks
qrels: query_id -> relevant_chunk_ids
3 embedding models
metrics: Hit@1, Hit@3, Recall@5, MRR@5, latency p50/p95

Mỗi query nên có metadata:

category: billing, security, API, policy, incident...
difficulty: easy, synonym, no-diacritic, English-mix, acronym, exact-number...
expected_behavior: dense should match semantic, BM25 should catch exact code...

Qrels là danh sách document/chunk đúng cho từng query:

{
  "q001": ["refund_policy"],
  "q002": ["invoice_vat"],
  "q003": ["sla_enterprise", "support_priority"]
}

Nếu một query có nhiều chunk đúng, Recall@K khác Hit@K. Đây là lý do không nên chỉ đo "có trúng một chunk không".

9. Metrics cần dùng

Metric	Công thức trực giác	Ý nghĩa
Hit@K	Có ít nhất một relevant chunk trong top K	Tốt cho RAG khi chỉ cần một nguồn đúng
Recall@K	Số relevant chunks lấy được / tổng relevant chunks	Quan trọng khi câu trả lời cần nhiều nguồn
MRR@K	1 / rank của relevant chunk đầu tiên	Đo chunk đúng xuất hiện sớm hay muộn
nDCG@K	Ranking có weighted relevance	Dùng khi có relevance 0/1/2/3
p50/p95 latency	Median và tail latency	Kiểm tra SLA
Cost/query	Chi phí online embedding	Kiểm soát unit economics
Storage/1M chunks	Raw vector + index overhead	Dự báo infra cost

Với RAG, retrieval metric tốt hơn không đảm bảo answer tốt hơn 100%, nhưng retrieval kém gần như chắc chắn làm answer kém. LLM không thể cite đúng tài liệu không được retrieve.

10. Production checklist

Một embedding setup dùng được trong production khi có đủ các điều kiện sau:

Có eval set nội bộ tối thiểu 100-500 queries theo category; bài học dùng 20 queries chỉ là bản học tập.
Có qrels review bởi người hiểu domain.
Có BM25 hoặc hybrid baseline để so sánh.
Có test riêng cho query không dấu, English-mix, acronym, số liệu và synonym.
Có version metadata: embedding_model, model_version, dimension, normalization, prefix_strategy, text_normalizer_version, chunking_version, index_version.
Không trộn vector từ nhiều model hoặc nhiều dimension trong cùng collection.
Có migration plan khi đổi model: tạo index mới, backfill, shadow traffic, compare, cutover, rollback.
Có timeout, retry, rate limit handling và batch size config.
Có privacy review nếu gửi chunk/query ra managed provider.
Có monitoring: latency, error rate, empty result rate, score distribution, query category, retrieval feedback.
Có cost dashboard: indexing cost, query cost, vector DB storage, replicas, backup.

11. Dùng được trong production không?

Có, embedding models dùng được trong production và là thành phần lõi của RAG. Nhưng điều kiện là không được dùng theo kiểu "chọn một model rồi hy vọng". Cần:

Benchmark trên dữ liệu thật hoặc gần thật.
Có qrels và regression test cho retrieval.
Có hybrid baseline, đặc biệt với tiếng Việt và enterprise docs.
Có versioning và reindex strategy.
Có privacy, cost, latency, monitoring và rollback plan.
Có ngưỡng chất lượng theo use case, ví dụ Recall@5 >= 0.90 cho support FAQ hoặc cao hơn cho domain rủi ro như legal/finance.

Best solution theo context:

Context	Khuyến nghị
Prototype nhỏ, data không nhạy cảm	Managed API embedding + vector DB managed, đo nhanh
Enterprise tiếng Việt có acronym/mã lỗi	Hybrid BM25 + dense, reranker nếu cần
Data residency nghiêm ngặt	Self-host BGE/E5/Vietnamese model, private vector DB
Corpus rất lớn, cost nhạy	Benchmark model nhỏ hơn, quantization/index tuning, batch indexing
Legal/finance	Hybrid + reranker + citation strict + human-reviewed qrels
Traffic lớn và qrels đủ tốt	Cân nhắc fine-tune embedding hoặc distill model

12. Liên kết với các ngày tiếp theo

Day 33 dùng kết quả benchmark để chọn vector DB config và metric.
Day 34 thay đổi chunking sẽ làm retrieval metric thay đổi, vì vậy phải re-run benchmark.
Day 35 metadata và permission filter phải chạy trước hoặc cùng retrieval để tránh leak dữ liệu.
Day 36 mở rộng dense-only thành hybrid search.
Day 37 thêm reranking khi top K có nhiều near-miss.
Day 39 biến benchmark hôm nay thành retrieval evaluation suite nghiêm túc hơn.

Tự kiểm tra

Vì sao cosine và dot product có thể cho ranking giống nhau khi vector đã normalize?
Vì sao embedding không thay thế BM25 trong RAG tiếng Việt?
Qrels khác gì với một danh sách query demo?
Khi đổi embedding model, vì sao phải tạo index mới?
Với 1M chunks và vector 1024 chiều float32, raw vector storage khoảng bao nhiêu?
Dùng managed API embedding trong production cần review những rủi ro nào?
Vì sao query không dấu cần nằm trong benchmark riêng?

Tài liệu

1. Reference architecture

                 Indexing path
Documents -> Parser -> Chunker -> Text normalizer -> Embedding worker
          -> Vector index + BM25 index + metadata store

                 Query path
User query -> Query normalizer -> Dense embed -> Vector search
           -> BM25 search -> Score fusion -> Permission filter
           -> Optional reranker -> Context builder -> LLM

Điểm dễ sai: permission filter phải được thiết kế rõ. Với tài liệu enterprise, đừng retrieve chunk mà user không có quyền rồi mới hy vọng LLM không dùng. Filter theo tenant, workspace, ACL, classification hoặc document visibility phải là một phần của retrieval plan.

2. Text normalization cho tiếng Việt

Normalization nên vừa đủ, không phá mất thông tin quan trọng.

Nên làm:

Chuẩn hóa Unicode về NFC/NFKC theo pipeline thống nhất.
Trim whitespace, collapse nhiều khoảng trắng.
Lowercase cho BM25 field phụ nếu phù hợp.
Tạo thêm field không dấu cho sparse search hoặc query expansion.
Giữ nguyên field gốc để hiển thị và citation.
Giữ mã lỗi, số hợp đồng, mã sản phẩm, SKU, %, +, #, / nếu có ý nghĩa.

Không nên làm bừa:

Xóa toàn bộ dấu câu trong legal/finance docs.
Xóa số vì "không semantic".
Xóa dấu tiếng Việt khỏi document gốc rồi chỉ index bản không dấu.
Apply normalizer khác nhau giữa indexing và query nhưng không version lại.

Metadata nên lưu:

{
  "text_normalizer_version": "vn-normalizer-2026-05-10",
  "source_text_checksum": "sha256:...",
  "indexed_text_checksum": "sha256:..."
}

3. Qrels schema

Qrels là ground truth cho retrieval. Mỗi query cần biết chunk nào đúng.

Schema gợi ý:

{
  "query_id": "q001",
  "query": "toi muon hoan tien goi Pro",
  "category": "billing",
  "difficulty": ["no-diacritic", "synonym"],
  "relevant_chunk_ids": ["refund_policy"],
  "notes": "User gõ không dấu, tài liệu có dấu."
}

Với production, qrels nên có:

Reviewer hoặc source tạo nhãn.
Ngày cập nhật.
Domain/category.
Độ khó.
Expected citation nếu dùng cho RAG answer eval.
Negative notes: chunk nào nhìn giống nhưng không đủ đúng.

4. Metrics definition

Với một query:

ranked = ["a", "b", "c", "d", "e"]
relevant = {"c", "e"}

Hit@3:

top3 = {"a", "b", "c"}
hit@3 = 1 vì có "c"

Recall@5:

top5 lấy được {"c", "e"} trong 2 relevant chunks
recall@5 = 2 / 2 = 1.0

MRR@5:

relevant đầu tiên là "c" ở rank 3
mrr@5 = 1 / 3 = 0.333

Report không nên chỉ có aggregate. Phải có fail cases:

query_id	query	difficulty	expected	top_5	lỗi
q007	API tra ve 429 nghia la gi	acronym/exact-code	api_rate_limit	password_reset,...	Dense không ưu tiên mã lỗi

5. Hybrid baseline

Dense search mạnh ở semantic similarity. BM25 mạnh ở exact lexical match. Tiếng Việt production thường cần cả hai.

Score fusion đơn giản:

dense_rank_score = 1 / (k + dense_rank)
bm25_rank_score = 1 / (k + bm25_rank)
final_score = alpha * dense_rank_score + (1 - alpha) * bm25_rank_score

Đây là Reciprocal Rank Fusion phiên bản có trọng số. Bắt đầu với:

k = 60
alpha = 0.5

Sau đó tune theo qrels. Không tune trên test set cuối; hãy tách dev/test nếu eval set đủ lớn.

6. Versioning và migration

Không coi embedding là config nhỏ. Đổi model là đổi schema retrieval.

Metadata bắt buộc:

{
  "embedding_model": "intfloat/multilingual-e5-large",
  "embedding_model_revision": "pinned-revision-or-provider-version",
  "dimension": 1024,
  "similarity_metric": "cosine",
  "normalized": true,
  "prefix_strategy": "e5-query-passage",
  "chunking_version": "chunk-v3",
  "index_version": "kb-embedding-2026-05-10"
}

Migration plan:

Tạo collection/index mới.
Backfill embeddings bằng model mới.
Chạy offline benchmark trên cùng qrels.
Chạy shadow traffic nếu có query log.
So sánh retrieval quality, latency, cost, error rate.
Cutover theo feature flag.
Giữ index cũ đủ lâu để rollback.

Không làm:

Upsert vector mới vào collection cũ nếu dimension/model khác.
Xóa index cũ trước khi có report regression.
Chỉ test vài query đẹp trong notebook rồi deploy.

7. Privacy và compliance

Nếu dùng managed embedding API, cần trả lời:

Query/chunk có chứa PII, secrets, contract, medical/legal/finance data không?
Provider có data retention thế nào?
Có dùng dữ liệu để train không?
Region xử lý dữ liệu ở đâu?
Có cần DPA, BAA hoặc điều khoản enterprise không?
Log nội bộ của mình có lưu raw query/chunk không?

Biện pháp giảm rủi ro:

Redact PII trước khi gửi nếu business cho phép.
Không log raw sensitive text ở level info.
Tách tenant và ACL trong metadata.
Encrypt backups.
Dùng self-host nếu data residency hoặc policy không cho phép external API.

8. Latency và cost model

Online path:

query embedding latency + vector search latency + BM25 latency + rerank latency + LLM latency

Embedding query thường chỉ là một phần của latency, nhưng p95 tăng mạnh nếu:

Provider throttling hoặc network chậm.
Self-host model không batch tốt.
Model quá lớn so với CPU/GPU.
Query path gọi embedding nhiều lần vì query rewrite/multi-query.

Indexing path:

num_chunks * embedding_cost_per_chunk + vector_db_upsert + index build time

Cost cần tính riêng:

Initial backfill.
Incremental updates.
Reindex khi đổi model/chunking.
Query embedding.
Vector DB storage/replicas/backups.
GPU/CPU serving nếu self-host.

9. Report template

# Embedding Benchmark Report

## Dataset

- Corpus: <số document>, <số chunk>, domain <...>
- Queries: <số query>, categories <...>
- Qrels reviewer: <ai/nhóm nào>
- Ngày chạy: <yyyy-mm-dd>

## Config

| Model | Dimension | Metric | Normalize | Prefix | Serving | Notes |
|---|---:|---|---|---|---|---|
| model-a | 1024 | cosine | yes | query/passsage | local GPU | ... |

## Metrics

| Model | Hit@1 | Hit@3 | Recall@5 | MRR@5 | p50 embed ms | p95 embed ms | Storage/1M chunks |
|---|---:|---:|---:|---:|---:|---:|---:|
| model-a | 0.70 | 0.85 | 0.90 | 0.76 | 45 | 120 | 4.1 GB |

## Failure Analysis

| Model | Query | Difficulty | Expected | Top 5 | Finding |
|---|---|---|---|---|---|
| model-a | ... | no-diacritic | ... | ... | ... |

## Decision

- Selected model: <model>
- Reason: <quality/cost/latency/privacy>
- Production conditions: <hybrid/reranker/versioning/monitoring>
- Rollback plan: <old index/version>

10. Review checklist

Bài tập

Mục tiêu

Bạn sẽ viết một benchmark nhỏ nhưng có cấu trúc gần production:

Có corpus tiếng Việt.
Có 20 queries.
Có qrels.
Có adapter cho nhiều embedding models.
Có validation dữ liệu trước khi chạy.
Có Hit@1, Hit@3, Recall@5, MRR@5.
Có latency p50/p95.
Có storage estimate.
Có report và failure analysis.

1. Cài đặt

python -m venv .venv
source .venv/bin/activate
pip install "pydantic>=2" sentence-transformers numpy pandas tabulate

Nếu máy yếu, bắt đầu với model nhỏ hơn. Nếu có GPU, cài PyTorch đúng CUDA theo môi trường của bạn trước.

2. Chọn 3 models

Gợi ý cho bài học:

MODELS = [
    "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
    "intfloat/multilingual-e5-base",
    "BAAI/bge-m3",
]

Nếu muốn thử Vietnamese-specific model, thay một model bằng model Vietnamese bi-encoder bạn tin cậy. Nếu muốn thử managed API như OpenAI hoặc Cohere, giữ cùng interface nhưng không hardcode API key trong script.

3. Script benchmark

Tạo file tạm, ví dụ benchmark_embeddings_day32.py, rồi dùng nội dung sau.

from __future__ import annotations

import argparse
import json
import math
import time
import unicodedata
from dataclasses import dataclass
from pathlib import Path

import numpy as np
import pandas as pd
from pydantic import BaseModel, Field, ValidationError, field_validator
from sentence_transformers import SentenceTransformer


TOP_K = 5


class DocumentChunk(BaseModel):
    id: str
    title: str
    text: str
    category: str

    @field_validator("id", "title", "text", "category")
    @classmethod
    def not_blank(cls, value: str) -> str:
        value = value.strip()
        if not value:
            raise ValueError("must not be blank")
        return value


class QueryCase(BaseModel):
    id: str
    query: str
    category: str
    difficulty: list[str] = Field(default_factory=list)
    relevant_chunk_ids: list[str]

    @field_validator("id", "query", "category")
    @classmethod
    def not_blank(cls, value: str) -> str:
        value = value.strip()
        if not value:
            raise ValueError("must not be blank")
        return value

    @field_validator("relevant_chunk_ids")
    @classmethod
    def has_relevance(cls, value: list[str]) -> list[str]:
        if not value:
            raise ValueError("query must have at least one relevant chunk")
        return value


DOCS = [
    {"id": "refund_policy", "title": "Chính sách hoàn tiền", "category": "billing", "text": "Khách hàng có thể yêu cầu hoàn tiền trong 7 ngày sau khi mua gói Pro nếu chưa sử dụng quá 20% quota. Yêu cầu hoàn tiền được xử lý qua cổng thanh toán trong 5 đến 10 ngày làm việc."},
    {"id": "invoice_vat", "title": "Xuất hóa đơn VAT", "category": "billing", "text": "Để xuất hóa đơn VAT, khách hàng cần cung cấp tên công ty, mã số thuế, địa chỉ đăng ký kinh doanh và email nhận hóa đơn. Thông tin phải được gửi trong vòng 30 ngày kể từ ngày thanh toán."},
    {"id": "sla_enterprise", "title": "SLA Enterprise", "category": "support", "text": "Gói Enterprise có SLA uptime 99.9% theo tháng. Sự cố P1 được phản hồi trong 2 giờ làm việc và được ưu tiên xử lý bởi nhóm hỗ trợ kỹ thuật."},
    {"id": "password_reset", "title": "Reset mật khẩu", "category": "security", "text": "Người dùng có thể reset mật khẩu bằng email đã đăng ký. Link reset hết hạn sau 30 phút và chỉ dùng được một lần."},
    {"id": "security_2fa", "title": "Xác thực hai lớp", "category": "security", "text": "Tài khoản admin bắt buộc bật xác thực hai lớp 2FA bằng ứng dụng authenticator. Recovery code phải được lưu ở nơi an toàn."},
    {"id": "api_rate_limit", "title": "Rate limit API", "category": "api", "text": "API public giới hạn 600 request mỗi phút cho mỗi API key. Khi vượt giới hạn, hệ thống trả về HTTP 429 kèm header Retry-After."},
    {"id": "sso_saml", "title": "SSO SAML", "category": "security", "text": "Khách hàng Enterprise có thể cấu hình SSO qua SAML 2.0. Metadata XML từ Identity Provider cần được upload trong trang quản trị."},
    {"id": "data_retention", "title": "Lưu trữ dữ liệu", "category": "privacy", "text": "Dữ liệu log ứng dụng được lưu trong 90 ngày. Bản sao lưu cơ sở dữ liệu được mã hóa và giữ trong 30 ngày trước khi xóa tự động."},
    {"id": "delete_account", "title": "Xóa tài khoản", "category": "privacy", "text": "Người dùng có thể yêu cầu xóa tài khoản và dữ liệu cá nhân. Quy trình xóa hoàn tất trong tối đa 15 ngày làm việc sau khi xác minh danh tính."},
    {"id": "webhook_retry", "title": "Retry webhook", "category": "api", "text": "Webhook thất bại sẽ được retry tối đa 5 lần với exponential backoff. Endpoint nhận webhook phải trả về HTTP 2xx trong 10 giây."},
    {"id": "pricing_seat", "title": "Tính phí theo seat", "category": "billing", "text": "Gói Team tính phí theo số lượng active seat trong chu kỳ thanh toán. Seat bị xóa giữa kỳ sẽ được prorate vào hóa đơn tiếp theo."},
    {"id": "trial_limit", "title": "Giới hạn dùng thử", "category": "billing", "text": "Tài khoản dùng thử có thời hạn 14 ngày và bị giới hạn 1.000 request API. Sau khi hết hạn trial, người dùng cần nâng cấp để tiếp tục sử dụng."},
    {"id": "audit_log", "title": "Audit log", "category": "security", "text": "Audit log ghi lại hành động đăng nhập, thay đổi quyền, tạo API key và cập nhật cấu hình bảo mật. Chỉ owner và admin được xem audit log."},
    {"id": "permission_roles", "title": "Vai trò và quyền", "category": "security", "text": "Hệ thống có ba vai trò mặc định: owner, admin và member. Owner có thể quản lý billing, admin quản lý cấu hình, member chỉ dùng tính năng được cấp quyền."},
    {"id": "model_region", "title": "Vùng xử lý model", "category": "privacy", "text": "Dữ liệu inference mặc định được xử lý tại vùng Singapore. Khách hàng Enterprise có thể yêu cầu cấu hình region riêng theo hợp đồng."},
    {"id": "file_upload_limit", "title": "Giới hạn upload", "category": "product", "text": "Mỗi file upload không được vượt quá 50 MB. Định dạng hỗ trợ gồm PDF, DOCX, TXT và CSV."},
    {"id": "ocr_quality", "title": "Chất lượng OCR", "category": "product", "text": "Tài liệu scan chất lượng thấp có thể làm OCR sai dấu tiếng Việt hoặc mất khoảng trắng. Nên kiểm tra preview trước khi đưa vào knowledge base."},
    {"id": "incident_status", "title": "Trang trạng thái sự cố", "category": "support", "text": "Khi có incident diện rộng, trạng thái hệ thống được cập nhật tại status page. Khách hàng có thể đăng ký email để nhận thông báo sự cố."},
    {"id": "api_key_rotation", "title": "Rotate API key", "category": "security", "text": "API key nên được rotate định kỳ 90 ngày một lần. Khi tạo key mới, hãy cập nhật ứng dụng trước khi thu hồi key cũ để tránh gián đoạn."},
    {"id": "export_data", "title": "Export dữ liệu", "category": "product", "text": "Người dùng có thể export dữ liệu dự án sang CSV hoặc JSON. File export được tạo bất đồng bộ và link tải xuống hết hạn sau 24 giờ."},
    {"id": "support_channels", "title": "Kênh hỗ trợ", "category": "support", "text": "Gói Free chỉ hỗ trợ qua community forum. Gói Pro hỗ trợ qua email, còn Enterprise có thêm Slack Connect và technical account manager."},
    {"id": "payment_failed", "title": "Thanh toán thất bại", "category": "billing", "text": "Nếu thanh toán thất bại, hệ thống sẽ thử lại trong 3 ngày liên tiếp. Sau 7 ngày chưa thanh toán, workspace bị chuyển sang trạng thái read-only."},
    {"id": "quota_overage", "title": "Vượt quota", "category": "billing", "text": "Khi vượt quota tháng, request mới có thể bị từ chối hoặc tính phí overage tùy cấu hình gói. Owner sẽ nhận email cảnh báo khi dùng quá 80% quota."},
    {"id": "ip_allowlist", "title": "IP allowlist", "category": "security", "text": "Enterprise admin có thể cấu hình IP allowlist để chỉ cho phép truy cập từ dải IP công ty. Thay đổi allowlist có hiệu lực sau vài phút."},
]


QUERIES = [
    {"id": "q001", "query": "tôi muốn hoàn tiền gói Pro", "category": "billing", "difficulty": ["synonym"], "relevant_chunk_ids": ["refund_policy"]},
    {"id": "q002", "query": "lam sao xuat hoa don VAT cho cong ty", "category": "billing", "difficulty": ["no-diacritic", "acronym"], "relevant_chunk_ids": ["invoice_vat"]},
    {"id": "q003", "query": "SLA của gói enterprise là bao nhiêu", "category": "support", "difficulty": ["acronym", "english-mix"], "relevant_chunk_ids": ["sla_enterprise"]},
    {"id": "q004", "query": "bật xác thực 2 lớp cho admin", "category": "security", "difficulty": ["synonym", "number"], "relevant_chunk_ids": ["security_2fa"]},
    {"id": "q005", "query": "API trả về 429 nghĩa là gì", "category": "api", "difficulty": ["exact-code"], "relevant_chunk_ids": ["api_rate_limit"]},
    {"id": "q006", "query": "quên mật khẩu thì reset như thế nào", "category": "security", "difficulty": ["english-mix"], "relevant_chunk_ids": ["password_reset"]},
    {"id": "q007", "query": "cau hinh SSO bang SAML 2.0", "category": "security", "difficulty": ["no-diacritic", "acronym"], "relevant_chunk_ids": ["sso_saml"]},
    {"id": "q008", "query": "log ứng dụng được giữ trong bao lâu", "category": "privacy", "difficulty": ["retention"], "relevant_chunk_ids": ["data_retention"]},
    {"id": "q009", "query": "xóa dữ liệu cá nhân mất mấy ngày", "category": "privacy", "difficulty": ["synonym"], "relevant_chunk_ids": ["delete_account"]},
    {"id": "q010", "query": "webhook fail co retry khong", "category": "api", "difficulty": ["no-diacritic", "english-mix"], "relevant_chunk_ids": ["webhook_retry"]},
    {"id": "q011", "query": "seat bị xóa giữa kỳ có được tính lại tiền không", "category": "billing", "difficulty": ["billing-term"], "relevant_chunk_ids": ["pricing_seat"]},
    {"id": "q012", "query": "trial được gọi bao nhiêu request API", "category": "billing", "difficulty": ["english-mix"], "relevant_chunk_ids": ["trial_limit"]},
    {"id": "q013", "query": "ai được xem audit log", "category": "security", "difficulty": ["english-mix"], "relevant_chunk_ids": ["audit_log"]},
    {"id": "q014", "query": "owner admin member khác nhau thế nào", "category": "security", "difficulty": ["role"], "relevant_chunk_ids": ["permission_roles"]},
    {"id": "q015", "query": "du lieu inference xu ly o region nao", "category": "privacy", "difficulty": ["no-diacritic", "english-mix"], "relevant_chunk_ids": ["model_region"]},
    {"id": "q016", "query": "upload file PDF tối đa bao nhiêu MB", "category": "product", "difficulty": ["exact-number"], "relevant_chunk_ids": ["file_upload_limit"]},
    {"id": "q017", "query": "OCR sai dấu tiếng Việt thì cần chú ý gì", "category": "product", "difficulty": ["ocr", "vietnamese"], "relevant_chunk_ids": ["ocr_quality"]},
    {"id": "q018", "query": "xem tình trạng incident ở đâu", "category": "support", "difficulty": ["english-mix", "synonym"], "relevant_chunk_ids": ["incident_status"]},
    {"id": "q019", "query": "bao lâu nên rotate API key", "category": "security", "difficulty": ["english-mix"], "relevant_chunk_ids": ["api_key_rotation"]},
    {"id": "q020", "query": "export dữ liệu sang csv json", "category": "product", "difficulty": ["acronym", "english-mix"], "relevant_chunk_ids": ["export_data"]},
]


@dataclass(frozen=True)
class ModelConfig:
    name: str
    batch_size: int = 16
    normalize: bool = True

    @property
    def uses_e5_prefix(self) -> bool:
        return "e5" in self.name.lower()

    @property
    def uses_bge_instruction(self) -> bool:
        return "bge" in self.name.lower()


class SentenceTransformerEmbedder:
    def __init__(self, config: ModelConfig) -> None:
        self.config = config
        self.model = SentenceTransformer(config.name)
        self.dimension = int(self.model.get_sentence_embedding_dimension())

    def _format(self, texts: list[str], kind: str) -> list[str]:
        if self.config.uses_e5_prefix:
            prefix = "query: " if kind == "query" else "passage: "
            return [prefix + text for text in texts]
        if self.config.uses_bge_instruction and kind == "query":
            instruction = "Represent this sentence for searching relevant passages: "
            return [instruction + text for text in texts]
        return texts

    def encode(self, texts: list[str], kind: str) -> np.ndarray:
        formatted = self._format(texts, kind)
        vectors = self.model.encode(
            formatted,
            batch_size=self.config.batch_size,
            convert_to_numpy=True,
            show_progress_bar=False,
        )
        vectors = np.asarray(vectors, dtype=np.float32)
        if self.config.normalize:
            vectors = normalize_rows(vectors)
        return vectors


def normalize_unicode(text: str) -> str:
    return unicodedata.normalize("NFC", text).strip()


def normalize_rows(vectors: np.ndarray) -> np.ndarray:
    norms = np.linalg.norm(vectors, axis=1, keepdims=True)
    return vectors / np.clip(norms, 1e-12, None)


def percentile(values: list[float], p: float) -> float:
    if not values:
        return 0.0
    sorted_values = sorted(values)
    index = (len(sorted_values) - 1) * p
    lower = math.floor(index)
    upper = math.ceil(index)
    if lower == upper:
        return sorted_values[int(index)]
    return sorted_values[lower] + (sorted_values[upper] - sorted_values[lower]) * (index - lower)


def load_dataset() -> tuple[list[DocumentChunk], list[QueryCase]]:
    try:
        docs = [DocumentChunk(**item) for item in DOCS]
        queries = [QueryCase(**item) for item in QUERIES]
    except ValidationError as exc:
        raise SystemExit(f"Dataset validation failed:\n{exc}") from exc

    doc_ids = [doc.id for doc in docs]
    duplicate_doc_ids = {doc_id for doc_id in doc_ids if doc_ids.count(doc_id) > 1}
    if duplicate_doc_ids:
        raise SystemExit(f"Duplicate doc ids: {sorted(duplicate_doc_ids)}")

    known_doc_ids = set(doc_ids)
    for query in queries:
        missing = set(query.relevant_chunk_ids) - known_doc_ids
        if missing:
            raise SystemExit(f"{query.id} references unknown chunks: {sorted(missing)}")

    if len(queries) < 20:
        raise SystemExit("Benchmark must contain at least 20 queries")

    return docs, queries


def rank_documents(
    embedder: SentenceTransformerEmbedder,
    docs: list[DocumentChunk],
    queries: list[QueryCase],
    top_k: int,
) -> tuple[pd.DataFrame, dict[str, float]]:
    doc_texts = [normalize_unicode(f"{doc.title}\n{doc.text}") for doc in docs]
    doc_vectors = embedder.encode(doc_texts, kind="passage")

    rows: list[dict[str, object]] = []
    latencies_ms: list[float] = []

    for query in queries:
        query_text = normalize_unicode(query.query)
        start = time.perf_counter()
        query_vector = embedder.encode([query_text], kind="query")
        latencies_ms.append((time.perf_counter() - start) * 1000)

        scores = (query_vector @ doc_vectors.T)[0]
        order = np.argsort(-scores)[:top_k]
        ranked_ids = [docs[index].id for index in order]
        ranked_scores = [float(scores[index]) for index in order]
        relevant = set(query.relevant_chunk_ids)

        first_relevant_rank = 0
        for rank, doc_id in enumerate(ranked_ids, start=1):
            if doc_id in relevant:
                first_relevant_rank = rank
                break

        rows.append(
            {
                "query_id": query.id,
                "query": query.query,
                "category": query.category,
                "difficulty": ",".join(query.difficulty),
                "expected": ",".join(query.relevant_chunk_ids),
                "top_ids": ranked_ids,
                "top_scores": ranked_scores,
                "hit@1": int(ranked_ids[0] in relevant),
                "hit@3": int(bool(set(ranked_ids[:3]) & relevant)),
                "recall@5": len(set(ranked_ids[:5]) & relevant) / len(relevant),
                "mrr@5": 0.0 if first_relevant_rank == 0 else 1.0 / first_relevant_rank,
            }
        )

    detail = pd.DataFrame(rows)
    summary = {
        "hit@1": float(detail["hit@1"].mean()),
        "hit@3": float(detail["hit@3"].mean()),
        "recall@5": float(detail["recall@5"].mean()),
        "mrr@5": float(detail["mrr@5"].mean()),
        "p50_query_embed_ms": percentile(latencies_ms, 0.50),
        "p95_query_embed_ms": percentile(latencies_ms, 0.95),
        "dimension": float(embedder.dimension),
        "storage_1m_float32_gb": estimate_storage_gb(1_000_000, embedder.dimension),
    }
    return detail, summary


def estimate_storage_gb(num_chunks: int, dimension: int) -> float:
    return num_chunks * dimension * 4 / 1_000_000_000


def write_report(output_dir: Path, model_name: str, detail: pd.DataFrame, summary: dict[str, float]) -> None:
    safe_name = model_name.replace("/", "__")
    output_dir.mkdir(parents=True, exist_ok=True)

    detail_path = output_dir / f"{safe_name}.details.csv"
    summary_path = output_dir / f"{safe_name}.summary.json"
    failures_path = output_dir / f"{safe_name}.failures.csv"

    detail.to_csv(detail_path, index=False)
    summary_path.write_text(json.dumps(summary, ensure_ascii=False, indent=2), encoding="utf-8")

    failures = detail[(detail["hit@3"] == 0) | (detail["recall@5"] < 1.0)].copy()
    failures.to_csv(failures_path, index=False)


def parse_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--models",
        nargs="+",
        default=[
            "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
            "intfloat/multilingual-e5-base",
            "BAAI/bge-m3",
        ],
    )
    parser.add_argument("--output-dir", default="day32_embedding_benchmark_report")
    return parser.parse_args()


def main() -> None:
    args = parse_args()
    docs, queries = load_dataset()
    output_dir = Path(args.output_dir)
    summaries: list[dict[str, object]] = []

    for model_name in args.models:
        print(f"Running benchmark for {model_name}")
        config = ModelConfig(name=model_name)
        embedder = SentenceTransformerEmbedder(config)
        detail, summary = rank_documents(embedder, docs, queries, top_k=TOP_K)
        write_report(output_dir, model_name, detail, summary)
        summaries.append({"model": model_name, **summary})

    summary_df = pd.DataFrame(summaries).sort_values(["mrr@5", "recall@5", "hit@1"], ascending=False)
    summary_df.to_csv(output_dir / "summary.csv", index=False)
    print(summary_df.to_markdown(index=False, floatfmt=".4f"))


if __name__ == "__main__":
    main()

4. Chạy benchmark

python benchmark_embeddings_day32.py

Chạy model tùy chọn:

python benchmark_embeddings_day32.py \
  --models sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 intfloat/multilingual-e5-base BAAI/bge-m3

Output:

day32_embedding_benchmark_report/
  summary.csv
  sentence-transformers__paraphrase-multilingual-MiniLM-L12-v2.details.csv
  sentence-transformers__paraphrase-multilingual-MiniLM-L12-v2.summary.json
  sentence-transformers__paraphrase-multilingual-MiniLM-L12-v2.failures.csv
  ...

5. Phân tích kết quả

Điền report theo mẫu:

# Day 32 Embedding Benchmark Report

## Dataset

- Số chunks: 24
- Số queries: 20
- Domain: SaaS support/product/security/billing/privacy
- Ngôn ngữ: tiếng Việt có dấu, không dấu, Viet-English mix

## Metrics

| Model | Hit@1 | Hit@3 | Recall@5 | MRR@5 | p50 ms | p95 ms | Dim | Storage/1M |
|---|---:|---:|---:|---:|---:|---:|---:|---:|
| ... | ... | ... | ... | ... | ... | ... | ... | ... |

## Failure cases

| Model | Query | Difficulty | Expected | Top 5 | Nhận xét |
|---|---|---|---|---|---|
| ... | API trả về 429 nghĩa là gì | exact-code | api_rate_limit | ... | Dense model không ưu tiên mã lỗi |

## Decision

- Model chọn cho RAG project:
- Có dùng BM25 không:
- Có cần reranker không:
- Điều kiện production:
- Rủi ro còn lại:

6. Câu hỏi bắt buộc

Trả lời ngắn gọn sau khi chạy:

Model nào có MRR@5 cao nhất?
Model nào có p95 latency tốt nhất?
Query không dấu có giảm chất lượng không?
Query acronym/mã lỗi như 429, SLA, SSO, VAT có cần BM25 không?
Nếu corpus có 1M chunks, model nào làm storage tăng nhiều nhất?
Bạn có dám dùng model thắng benchmark này trong production không? Nếu có, cần điều kiện gì?

7. Mở rộng gần production

Sau khi hoàn thành bản dense-only, thêm các bước sau:

Thêm BM25 bằng rank-bm25 hoặc search engine sẵn có.
Implement Reciprocal Rank Fusion để merge dense ranking và BM25 ranking.
Tách qrels thành dev và test.
Thêm query log ẩn danh từ người dùng thật.
Thêm regression threshold, ví dụ fail CI nếu Recall@5 giảm hơn 3%.
Thêm metadata filter theo category để mô phỏng permission/domain filter.
Thử chunking khác nhau và so sánh lại metric.

8. Tiêu chí hoàn thành

Chạy được ít nhất 3 models hoặc giải thích rõ model nào không chạy được vì tài nguyên.
Có summary.csv.
Có file failure cases cho từng model.
Có quyết định model chọn, không chỉ bảng điểm.
Có trả lời production readiness.
Có đề xuất hybrid baseline cho tiếng Việt.