Day 25: Khi nào Fine-tune, khi nào dùng RAG

Mục Tiêu

Sau bài này, bạn cần làm được các việc sau:

Phân biệt đúng vai trò của prompt engineering, RAG, tool calling, fine-tuning và distillation.
Biết khi nào nên dùng prompt-only, khi nào thêm RAG, khi nào gọi tool, khi nào fine-tune, và khi nào kết hợp nhiều kỹ thuật.
Hiểu full fine-tuning, PEFT, LoRA, QLoRA, adapter, prompt tuning và distillation ở mức đủ để ra quyết định engineering.
Thiết kế được hybrid RAG + fine-tuned model cho production.
Viết được decision record cho một AI feature, có trade-off về quality, cost, latency, privacy, rollback và operability.
Trả lời rõ: dùng được trong production không, nếu có thì cần điều kiện gì.

TL;DR

RAG đưa knowledge từ nguồn bên ngoài vào runtime context. Tool calling lấy realtime data hoặc thực hiện action qua API. Fine-tuning thay đổi behavior của model bằng training data. Prompt engineering là lớp điều khiển nhanh nhất, rẻ nhất để thử nghiệm nhưng kém bền khi workflow phức tạp.

Nếu vấn đề là "model không biết facts mới, private docs hoặc realtime state", ưu tiên RAG hoặc tool. Nếu vấn đề là "model không ổn định về format, tone, policy behavior hoặc domain workflow lặp lại", cân nhắc fine-tuning sau khi đã có baseline và eval. Production thường không chọn một kỹ thuật duy nhất: RAG/tool giữ source of truth, fine-tuned model giữ behavior, validator giữ contract.

1. Bài Này Nằm Ở Đâu Trong Lộ Trình

Day 21-24 đã đi qua framework, agent, security, tool calling và memory. Day 25 mở phase Fine-tuning & Local LLM: trước khi train bất kỳ model nào, bạn phải biết có thật sự cần train không.

Day 21-24: app orchestration, agent, security, tool, memory
Day 25: quyết định prompt/RAG/tool/fine-tune/distill
Day 26: chuẩn bị dataset instruction tuning
Day 27: chạy LoRA/QLoRA hands-on
Day 28: evaluate trước/sau fine-tune
Day 29-30: local LLM và deploy

Sai lầm phổ biến của team mới làm AI là fine-tune để giải quyết mọi vấn đề. Với góc nhìn production, fine-tuning là build một artifact mới, kéo theo dataset, training job, model registry, eval, deployment, rollback, privacy review và monitoring. Nếu cùng kết quả có thể đạt được bằng prompt, schema validation, RAG hoặc tool, hãy chứng minh fine-tune đáng tiền bằng metric trước.

2. Mental Model Cho Senior Software Engineer

Kỹ thuật	SE analogy	Tác dụng chính	Thời điểm thay đổi
Prompt engineering	Runtime config	Hướng dẫn cách trả lời trong request	Mỗi deploy hoặc mỗi request
Structured output	API contract	Ép output theo schema	Mỗi request
RAG	Read path tới search/database	Đưa tài liệu liên quan vào context	Khi index/document thay đổi
Tool calling	Internal API/RPC	Lấy realtime data hoặc thực hiện action	Khi state backend thay đổi
Fine-tuning	Build artifact mới	Dạy behavior từ examples	Khi train/deploy model mới
Distillation	Rebuild service nhỏ hơn	Nén capability vào model nhỏ	Khi train/deploy model mới

Rule ngắn:

Cần facts mới hoặc private docs -> RAG
Cần realtime state hoặc action -> tool calling
Cần format/tone/workflow ổn định -> prompt + schema trước, fine-tune nếu failure lặp lại
Cần giảm cost/latency cho task hẹp -> distill/fine-tune model nhỏ
Cần cả facts và behavior -> hybrid RAG/tool + fine-tune + validation

3. Đừng Bắt Đầu Bằng Fine-tuning

Trình tự thực tế nên là:

Xác định task: hỏi đáp, extraction, classification, generation, coding, support, agent workflow.
Viết prompt baseline với input/output contract rõ.
Thêm structured output hoặc JSON schema nếu downstream cần parse.
Nếu thiếu knowledge, thêm RAG.
Nếu cần realtime data/action, thêm tool calling.
Tạo golden set gồm input, expected output, nguồn đúng, failure case và metric.
Chỉ fine-tune khi baseline đã có failure mode lặp lại mà prompt/RAG/tool không giải quyết tốt hoặc quá đắt.

Ví dụ: model hay trả JSON sai. Fine-tune có thể giúp, nhưng production fix đầu tiên là schema validation, retry có kiểm soát, constrained decoding hoặc provider structured output. Fine-tune chỉ nên vào sau nếu lỗi format vẫn cao trên golden set hoặc prompt quá dài làm cost/latency xấu.

4. Khi Nào Dùng Prompt-only

Prompt-only phù hợp khi:

Task đơn giản, ít rủi ro, không cần facts private.
Output là text tự nhiên, downstream không parse nghiêm ngặt.
Traffic thấp hoặc đang discovery.
Failure có thể chấp nhận bằng human review.
Yêu cầu thay đổi liên tục, chưa có metric ổn định.

Không nên chỉ dùng prompt khi:

Cần citation, ACL, tenant isolation hoặc audit source.
Cần data realtime như order status, account balance, inventory.
Output đi vào billing, compliance, legal hoặc workflow tự động.
Prompt dài, dễ drift và khó version.

Production note: prompt là artifact cần version, owner, changelog, eval result và rollback. "Chỉ prompt" không có nghĩa là "không cần engineering".

5. Khi Nào Dùng RAG

RAG phù hợp khi source of truth nằm ngoài model:

Tài liệu nội bộ, policy, handbook, runbook, ticket history.
Knowledge thay đổi hằng ngày, hằng tuần hoặc theo tenant.
Cần citation/source để audit hoặc để người dùng kiểm chứng.
Cần permission-aware access control.
Không được đưa toàn bộ dữ liệu vào training vì privacy, license hoặc compliance.

Pipeline RAG production-style:

User query
  -> auth + tenant resolution
  -> query rewrite / classification
  -> embedding hoặc hybrid search
  -> metadata filter + ACL
  -> rerank
  -> context builder với token budget
  -> LLM answer
  -> citation checker + schema validator
  -> response + trace

RAG không tự động giải quyết:

Output sai schema.
Tone không đúng brand.
Model không tuân thủ workflow nhiều bước.
Reasoning domain bị yếu dù context đã đúng.
Hallucination nếu retrieved chunks sai, thiếu hoặc bị prompt injection.

Performance trade-off:

RAG thêm latency cho embedding/search/rerank. Budget thường gặp: retrieval 50-300 ms với index nội bộ tốt, rerank 100-800 ms tùy model và top_k, generation vẫn là phần lớn latency.
top_k nên bắt đầu 3-8 chunks. Quá ít thì thiếu context, quá nhiều thì tăng token cost và noise.
Hybrid search dense + BM25 thường tốt hơn dense-only cho tài liệu có mã, tên sản phẩm, policy ID hoặc thuật ngữ hiếm.
Rerank tăng quality nhưng cần cache và timeout riêng.

6. Khi Nào Dùng Tool Calling

Tool calling phù hợp khi model cần đọc hoặc thay đổi state thật:

Kiểm tra order status, refund status, account tier, quota.
Tạo ticket, cập nhật CRM, gửi email, đặt lịch.
Gọi pricing service, inventory service, fraud service.
Tính toán deterministic bằng service đã kiểm thử.

Nguyên tắc production:

Model chỉ đề xuất tool call; application mới thực thi.
Tool phải có auth, authorization, idempotency, timeout, retry policy và audit log.
Không để model tự quyết định permission.
Với action có side effect, cần confirmation hoặc policy gate.

Tool calling không thay RAG. Tool trả state hoặc thực hiện action; RAG đọc tài liệu. Một support assistant thường cần cả hai: RAG để đọc policy refund, tool để kiểm tra đơn hàng, validator để đảm bảo response có case_id, next_action, risk_level.

7. Khi Nào Fine-tune

Fine-tuning phù hợp khi bạn muốn model học behavior từ examples:

Output format rất ổn định và schema phức tạp.
Style/tone riêng của brand hoặc team cần nhất quán ở volume lớn.
Domain workflow lặp lại: triage, coding review, complaint handling, compliance refusal.
Classification/extraction/generation task hẹp có nhiều labeled examples.
Prompt quá dài vì phải lặp instruction nhiều lần.
Muốn dùng model nhỏ hơn để giảm cost/latency nhưng vẫn đạt quality mục tiêu.

Fine-tuning không phù hợp để:

Nhồi facts thay đổi thường xuyên vào weights.
Thay database, search engine hoặc permission system.
Sửa ingestion/chunking/retrieval kém.
Bảo đảm không hallucinate.
Bỏ qua schema validation.
Che giấu việc chưa có eval set.

Điều kiện tối thiểu trước khi fine-tune:

Có baseline prompt/RAG/tool và biết failure mode cụ thể.
Có dataset sạch, có license hợp lệ, không chứa PII/secret không được phép train.
Có train/validation/test split, golden set không trùng train.
Có metric quality, latency, cost và safety.
Có registry cho dataset version, base model, adapter/model artifact, prompt version và eval result.
Có rollback về base model hoặc adapter trước đó.

8. Các Kiểu Fine-tuning

Kỹ thuật	Ý tưởng	Nên dùng khi	Trade-off
Full fine-tuning	Update toàn bộ weights	Có data lớn, GPU/MLOps mạnh, cần thay đổi behavior sâu	Cost cao, artifact lớn, dễ overfit hoặc catastrophic forgetting
PEFT	Chỉ train một phần nhỏ parameter	Muốn tiết kiệm VRAM và quản lý nhiều task/domain	Phụ thuộc runtime support, có thể kém full fine-tune nếu task lệch xa
LoRA	Thêm low-rank matrices vào một số layer	Instruction tuning phổ biến, cần artifact nhỏ	Cần chọn rank/target modules; nếu rank quá thấp có thể underfit
QLoRA	LoRA trên base model quantized 4-bit	GPU hạn chế, muốn fine-tune model lớn hơn	Train tiết kiệm VRAM nhưng inference/deploy cần kiểm tra chất lượng quantization
Adapter	Chèn module nhỏ vào network	Nhiều domain/task, muốn bật/tắt adapter	Runtime phức tạp hơn, không phải stack nào cũng support tốt
Prompt tuning	Train soft prompt vector	Task hẹp, model lớn, muốn giữ weights	Ít phổ biến hơn LoRA trong app engineering; khó debug vì prompt không đọc được
Distillation	Train model nhỏ bắt chước model lớn	High throughput, latency/cost thấp, task hẹp rõ	Cần teacher output chất lượng và eval chặt để tránh mất capability

Full Fine-tuning

Full fine-tuning update toàn bộ weights của model. Nó mạnh nhưng đắt và nhiều rủi ro. Bạn cần data đủ lớn, compute đủ mạnh, monitoring training, checkpointing, eval nhiều chiều và deployment discipline.

Nên cân nhắc khi:

Domain rất khác base model.
Task có nhiều dữ liệu chất lượng cao.
PEFT không đạt metric sau nhiều thử nghiệm hợp lý.
Team có MLOps để vận hành model artifact lớn.

Không nên là lựa chọn đầu tiên cho team app engineering. Với phần lớn use case enterprise assistant, LoRA/QLoRA hoặc distillation thực dụng hơn.

PEFT, LoRA Và QLoRA

PEFT là họ kỹ thuật fine-tuning chỉ update một phần nhỏ parameter. LoRA là biến thể phổ biến: thêm low-rank matrices vào các layer attention/MLP, train các matrices này, giữ base model cố định. QLoRA tiết kiệm VRAM hơn bằng cách quantize base model, rồi train LoRA adapter.

Decision thực tế:

Có GPU hạn chế, muốn thử nhanh: QLoRA.
Có runtime cần merge adapter vào base model để inference đơn giản: LoRA rồi merge nếu quality không giảm.
Có nhiều tenant/domain riêng: giữ nhiều adapter và route theo tenant/domain, nhưng phải kiểm soát memory và cold start.

Adapter Và Prompt Tuning

Adapter cũng là PEFT nhưng chèn module nhỏ vào network. Nó hữu ích khi cần nhiều task/domain riêng nhưng yêu cầu runtime phức tạp hơn. Prompt tuning train một vector "soft prompt", không giống prompt text do người đọc viết. Nó có thể hiệu quả với task hẹp nhưng khó debug và ít trực quan hơn LoRA.

Distillation

Distillation không nhất thiết là fine-tuning theo nghĩa instruction tuning, nhưng thường nằm cùng decision space. Bạn dùng model lớn làm teacher để tạo label/output, rồi train model nhỏ cho task hẹp.

Use case tốt:

1 triệu request/ngày cho classification hoặc extraction.
Model lớn đạt quality tốt nhưng cost/request cao.
Latency target thấp, ví dụ p95 dưới 500 ms.
Output contract rõ và eval tự động được.

9. Hybrid RAG + Fine-tuned Model

Pattern production thường gặp:

Client
  -> API Gateway / Auth
  -> Orchestrator
      -> Query classifier
      -> Retriever với ACL
      -> Tool layer với policy gate
      -> Context builder
      -> Fine-tuned LLM hoặc adapter-routed LLM
      -> Schema validator
      -> Citation checker
      -> Safety/compliance checker
  -> Response + trace

Phân chia trách nhiệm:

RAG cung cấp facts và citation.
Tool cung cấp realtime state/action.
Fine-tuned model cung cấp tone, format, workflow, domain behavior.
Validator enforce API contract.
Eval phát hiện regression.
Observability cho biết model, adapter, prompt, retrieval index và tool version nào tạo ra response.

Ví dụ fintech support assistant:

RAG lấy policy refund hiện hành với citation.
Tool kiểm tra account status và transaction history.
Fine-tuned model học cách trả lời ngắn gọn, không hứa vượt policy, biết escalation.
Schema validator bắt buộc output có answer, sources, risk_level, needs_human, next_action.
Rollback có thể chuyển từ adapter support-v3 về support-v2 hoặc về base model + prompt nếu eval live xấu.

10. Decision Matrix

Nhu cầu	Nên bắt đầu với	Vì sao
Trả lời theo policy nội bộ có citation	RAG	Facts nằm trong tài liệu và cần source
Giá, tồn kho, order status realtime	Tool calling	Cần state live từ backend
Output JSON sai format lặp lại	Structured output, validator, sau đó fine-tune nếu cần	Schema là contract bắt buộc; fine-tune chỉ tăng độ ổn định
Customer support cần tone riêng	Prompt baseline, sau đó fine-tune nếu volume/failure đủ lớn	Tone là behavior pattern
Knowledge thay đổi hằng tuần	RAG	Update index rẻ và nhanh hơn train
Muốn giảm cost/request cho task hẹp	Distillation hoặc fine-tune model nhỏ	Model nhỏ có thể đủ chất lượng
Domain jargon và cách trả lời chuẩn	RAG + fine-tune	RAG cấp facts, fine-tune cấp style/workflow
Hallucination về tài liệu	Fix retrieval/eval trước	Fine-tune không sửa retrieved context sai
Multi-tenant private docs	RAG với ACL	Fine-tune có risk memorize và leak data
Cần action có side effect	Tool calling + policy gate	Model không nên tự thực thi action

11. Dataset, Privacy Và Governance

Checklist dữ liệu trước khi train:

Data source có quyền dùng cho training không.
Có PII, PHI, secret, token, password, customer contract hoặc dữ liệu regulated không.
Có cần anonymization, redaction hoặc synthetic data không.
Có consent hoặc DPA phù hợp không.
Có license của base model và dataset phù hợp commercial/internal use không.
Có chống data contamination giữa train và eval không.
Có version dataset bằng hash, manifest hoặc data registry không.

Một format instruction tuning tối thiểu:

{
  "id": "support_refund_001",
  "messages": [
    {
      "role": "system",
      "content": "Bạn là support assistant cho fintech. Trả lời ngắn, đúng policy, không hứa hoàn tiền nếu chưa đủ điều kiện."
    },
    {
      "role": "user",
      "content": "Tôi bị trừ tiền hai lần khi nâng cấp gói."
    },
    {
      "role": "assistant",
      "content": "{\"answer\":\"Mình sẽ kiểm tra giao dịch bị trừ lặp và tạo yêu cầu đối soát nếu đủ điều kiện.\",\"risk_level\":\"medium\",\"needs_human\":true}"
    }
  ],
  "metadata": {
    "source": "resolved_ticket",
    "policy_version": "refund_policy_2026_04",
    "contains_pii": false,
    "split": "train"
  }
}

Không đưa raw ticket chứa email, số điện thoại, card number, access token hoặc nội dung nhạy cảm vào training nếu chưa qua governance. Với RAG, dữ liệu private vẫn phải có ACL trước retrieval, không chỉ filter sau khi LLM trả lời.

12. Eval Trước Khi Quyết Định

Không fine-tune bằng cảm giác. Cần eval set trước.

Metric theo lớp:

Retrieval: recall@k, MRR, nDCG, citation hit rate, permission violation rate.
Generation: exact match cho extraction/classification, rubric score cho text, JSON validity, schema pass rate.
Faithfulness: answer có được support bởi context không, citation correctness.
Safety: refusal accuracy, policy violation rate, prompt injection success rate.
Business: task success rate, human escalation rate, handle time, CSAT proxy.
Performance: p50/p95 latency, token/request, cost/request, throughput, error rate.

So sánh tối thiểu:

Variant	Quality	p95 latency	Cost/request	Risk
Prompt baseline
Prompt + RAG/tool
Fine-tuned model
Hybrid

Fine-tune đáng làm khi nó cải thiện metric quan trọng đủ lớn để bù chi phí vận hành. Ví dụ: schema pass rate từ 92% lên 99.2%, p95 latency giảm 40% do dùng model nhỏ, hoặc human escalation giảm 15% mà safety không xấu đi.

13. Cost Và Latency

Cost cần tính toàn vòng đời:

Build cost: data cleaning, labeling, privacy review, training compute, experiments.
Inference cost: token, model price, GPU hours, adapter memory, batching.
Ops cost: registry, eval, monitoring, rollback, on-call.
Opportunity cost: thời gian team dùng để train thay vì sửa retrieval hoặc product flow.

Latency trade-off:

Prompt-only thường đơn giản nhất nhưng prompt dài làm latency tăng.
RAG thêm retrieval/rerank latency nhưng giảm hallucination về facts.
Tool calling thêm network latency và failure modes của backend service.
Fine-tuned model nhỏ có thể giảm latency, nhưng adapter routing hoặc cold load có thể tăng tail latency.
Full fine-tuned model lớn không mặc định nhanh hơn base model.

Một budget ví dụ cho support assistant:

p95 target: 2.5s
auth + request validation: 50ms
retrieval + ACL + rerank: 450ms
tool calls: 300ms với timeout 700ms
LLM generation: 1.4s
validation + citation check: 150ms
buffer: 450ms

Nếu RAG + tool vượt latency target, đừng vội fine-tune. Kiểm tra cache, parallel retrieval/tool, giảm top_k, chọn reranker nhẹ hơn, streaming response hoặc route task đơn giản sang model nhỏ.

14. Rollback Và Deployment

Model deployment cần giống software deployment:

Version mọi thứ: dataset, prompt, base model, adapter, tokenizer, retrieval index, reranker, schema.
Có offline eval pass gate trước deploy.
Canary theo tenant, traffic percent hoặc task type.
Shadow mode để so sánh response nhưng chưa trả cho user.
Có rollback nhanh về adapter/model/prompt trước đó.
Log đủ metadata để debug từng response.

Metadata nên log:

{
  "trace_id": "tr_123",
  "tenant_id": "tenant_a",
  "prompt_version": "support_prompt_v4",
  "base_model": "base-model-x",
  "adapter_version": "support_lora_v3",
  "retrieval_index": "policy_index_2026_05_01",
  "reranker_version": "reranker_v2",
  "schema_version": "support_response_v1",
  "input_tokens": 1840,
  "output_tokens": 220,
  "latency_ms": 2130,
  "estimated_cost_usd": 0.0041,
  "eval_tags": ["refund", "billing"]
}

15. Dùng Được Trong Production Không? Nếu Có Thì Cần Điều Kiện Gì?

Có, nhưng từng kỹ thuật có điều kiện khác nhau.

Prompt-only dùng được trong production khi task rủi ro thấp, có prompt versioning, eval cơ bản, observability, token budget và fallback.

RAG dùng được trong production khi ingestion ổn định, chunking có kiểm thử, retrieval eval đạt ngưỡng, ACL được enforce trước khi build context, citation checker hoạt động, index được version và có monitoring drift.

Tool calling dùng được trong production khi tool có auth, authorization, idempotency, timeout, retry, audit log, confirmation cho side effect và policy gate tách khỏi model.

Fine-tuning dùng được trong production khi có dataset hợp pháp và sạch, eval offline/online, model registry, canary, rollback, privacy review, cost model, latency test và monitoring regression.

Hybrid RAG + fine-tune dùng được trong production khi team đủ vận hành cả retrieval pipeline lẫn model artifact. Đây thường là best solution cho enterprise assistant phức tạp, nhưng không nên dùng nếu chưa có metric rõ vì complexity tăng mạnh.

16. Best Solution Theo Context

Context	Best starting solution	Khi nào nâng cấp
POC nội bộ trong 1 tuần	Prompt + structured output	Thêm RAG nếu thiếu facts; thêm eval trước khi mở rộng
Q&A theo tài liệu công ty	RAG + citation	Thêm fine-tune nếu tone/workflow vẫn lỗi trên golden set
Support tạo ticket và cập nhật CRM	Tool calling + RAG + validator	Fine-tune khi response style và triage decision không ổn định
Extraction hóa đơn	Structured output + validator	Fine-tune/distill nếu volume cao hoặc schema pass rate thấp
Product FAQ có giá/tồn kho	RAG cho docs + tool cho price/inventory	Không fine-tune facts realtime; chỉ fine-tune tone nếu cần
High-volume classification	Prompt baseline rồi distill/fine-tune model nhỏ	Khi cost/latency của model lớn vượt budget

17. Tự Kiểm Tra

Vì sao fine-tuning không phải cách tốt để cập nhật facts realtime?
Khi nào RAG không đủ và cần fine-tune?
LoRA khác full fine-tuning ở điểm nào?
QLoRA giải quyết bài toán gì, đổi lại rủi ro nào?
Hybrid RAG + fine-tune chia trách nhiệm ra sao?
Metric nào chứng minh fine-tune đáng giá hơn prompt/RAG/tool?
Vì sao multi-tenant private docs nên dùng RAG với ACL thay vì fine-tune chung?

18. Checklist

Phân biệt được prompt, RAG, tool calling, fine-tuning và distillation.
Giải thích được vì sao facts mới nên dùng RAG/tool.
Giải thích được vì sao behavior/format ổn định có thể cần fine-tune.
Hiểu full fine-tuning, PEFT, LoRA, QLoRA, adapter và prompt tuning.
Biết thiết kế hybrid RAG + fine-tune.
Có decision matrix cho ít nhất 5 use case.
Có golden metrics trước khi đề xuất fine-tune.
Có production notes về dataset, privacy, eval, rollback, cost và latency.

Tài liệu

1. Decision Flow

Dùng flow này trước khi đề xuất fine-tuning:

1. Task có cần facts/private docs không?
   Có -> RAG với ACL, citation, retrieval eval.
   Không -> sang bước 2.

2. Task có cần realtime state hoặc side effect không?
   Có -> tool calling với policy gate, audit, idempotency.
   Không -> sang bước 3.

3. Output có cần machine-readable contract không?
   Có -> structured output, schema validation, retry có giới hạn.
   Không -> sang bước 4.

4. Failure chính là behavior/style/workflow lặp lại?
   Có -> tạo golden set, so sánh prompt baseline với fine-tune/PEFT.
   Không -> sửa product flow, prompt, retrieval, data hoặc UX trước.

5. Cost/latency của model lớn có vượt budget cho task hẹp không?
   Có -> cân nhắc distillation hoặc fine-tune model nhỏ.

2. Technique Scorecard

Chấm 1-5, điểm cao hơn là phù hợp hơn.

Tiêu chí	Prompt	RAG	Tool	Fine-tune	Distill
Setup nhanh	5	3	3	1	1
Facts thay đổi thường xuyên	1	5	5	1	1
Citation/source	1	5	3	1	1
Realtime action	1	1	5	1	1
Format/tone ổn định	3	2	1	5	4
Giảm prompt dài	2	2	1	5	4
Giảm latency/cost ở scale	2	2	2	4	5
Ops complexity thấp	5	3	3	1	1
Privacy với private tenant docs	3	5	4	2	2

Không cộng điểm máy móc. Scorecard giúp đặt câu hỏi đúng, decision cuối cùng phải dựa vào metric và risk.

3. AI Technique Decision Record Template

# AI Technique Decision Record

## Context

- Feature:
- Owner:
- Users:
- Tenant/data scope:
- Data sensitivity:
- Expected traffic:
- p95 latency target:
- Cost/request target:
- Release deadline:

## Problem

Mô tả user workflow, input, output mong muốn và downstream dependency.

## Current Baseline

- Prompt version:
- Model:
- RAG index/tool hiện có:
- Eval set:
- Quality:
- p95 latency:
- Cost/request:

## Failure Modes

- Facts sai:
- Không có source/citation:
- Retrieval miss:
- Permission leak:
- Format sai:
- Tone/style sai:
- Workflow sai:
- Tool call sai:
- Latency/cost cao:
- Safety/compliance issue:

## Options

| Option | Quality | Cost | Latency | Ops complexity | Privacy risk | Rollback | Notes |
|---|---:|---:|---:|---:|---:|---:|---|
| Prompt-only | | | | | | | |
| RAG | | | | | | | |
| Tool calling | | | | | | | |
| Fine-tune/PEFT | | | | | | | |
| Distillation | | | | | | | |
| Hybrid | | | | | | | |

## Decision

- Chọn:
- Không chọn:
- Lý do:
- Điều kiện để revisit:

## Implementation Plan

1. Baseline:
2. Data/index/tool:
3. Eval:
4. Rollout:
5. Monitoring:
6. Rollback:

## Metrics

- Task success rate:
- Schema pass rate:
- Faithfulness/citation correctness:
- Retrieval recall@k:
- Tool success rate:
- Human escalation rate:
- p95 latency:
- Cost/request:
- Safety violation rate:

4. Use Case Decision Records

Use Case 1: Chatbot Hỏi Đáp Policy Nội Bộ

Decision: RAG trước, không fine-tune facts.

Lý do:

Policy thay đổi theo thời gian và cần citation.
Người dùng cần biết câu trả lời dựa trên tài liệu nào.
Có risk permission theo department hoặc tenant.

Implementation:

Auth -> tenant/role -> hybrid search -> ACL filter -> rerank -> context builder -> LLM -> citation checker

Metrics:

retrieval recall@5 >= 0.85 trên golden set.
citation correctness >= 0.95.
permission violation rate = 0.
p95 latency <= 3s.

Khi nào thêm fine-tune: nếu retrieved context đúng nhưng model liên tục trả lời quá dài, sai tone HR/compliance, hoặc không biết refusal pattern dù prompt đã tốt.

Use Case 2: Support Assistant Tạo Ticket

Decision: hybrid tool calling + RAG + structured output; fine-tune sau nếu triage/tone lỗi lặp lại.

Lý do:

Cần RAG để đọc policy.
Cần tool để tạo ticket và kiểm tra account/order.
Cần schema để downstream ticket system parse.
Fine-tune có ích nếu có nhiều resolved tickets chất lượng cao.

Output contract:

{
  "summary": "Khách bị tính phí hai lần sau nâng cấp.",
  "category": "billing",
  "priority": "medium",
  "needs_human": true,
  "tool_calls": [
    {"name": "create_ticket", "arguments": {"category": "billing"}}
  ]
}

Production condition: tool execution phải idempotent, có audit log và không để model tự set priority cao nếu policy không cho phép.

Use Case 3: Extract Invoice Thành JSON

Decision: structured output + validator trước; fine-tune hoặc distill nếu volume cao hoặc schema pass rate chưa đạt.

Lý do:

Đây là task hẹp, output contract rõ.
RAG thường không cần nếu dữ liệu nằm trong invoice input.
Fine-tune model nhỏ có thể giảm cost nếu xử lý nhiều hóa đơn.

Metrics:

JSON validity >= 99.5%.
field-level F1 cho invoice_number, date, total, tax, vendor >= 0.97.
p95 latency <= 1.5s nếu synchronous.
human correction rate giảm rõ so với baseline.

Rollback: nếu fine-tuned model miss vendor hiếm hoặc tax rule mới, route fallback sang base model + validator cho nhóm input đó.

Use Case 4: Code Review Assistant Theo Style Team

Decision: RAG coding standards + fine-tune/LoRA cho comment style nếu có dataset review chất lượng.

Lý do:

Coding standards thay đổi nên nên nằm trong RAG.
Style review, severity labeling và comment format là behavior pattern.
Tool có thể gọi static analysis, test result hoặc code search.

Architecture:

PR diff -> static analysis tools -> RAG team standards -> LLM reviewer -> schema validator -> comments

Fine-tune condition:

Có ít nhất vài nghìn comment review tốt, đã loại bỏ thông tin nhạy cảm.
Eval đo false positive, missed critical issue, usefulness score và comment tone.
Có guardrail không sinh secret, không leak code ngoài scope.

Use Case 5: Product FAQ Với Giá Và Tồn Kho Realtime

Decision: RAG cho product docs, tool calling cho price/inventory, không fine-tune facts realtime.

Lý do:

Giá và tồn kho thay đổi liên tục.
Product description có thể nằm trong catalog/search index.
Fine-tune facts sẽ stale và có risk trả thông tin sai.

Implementation:

Query -> product search/RAG -> price tool -> inventory tool -> response with freshness timestamp

Metrics:

price accuracy = 100% so với pricing service.
inventory accuracy = 100% so với inventory service.
stale response rate = 0 cho giá/tồn kho.
p95 latency <= 2s hoặc dùng streaming nếu tool chậm.

Khi nào fine-tune: chỉ khi cần tone bán hàng, objection handling hoặc format tư vấn sản phẩm nhất quán, không dùng để lưu giá/tồn kho.

5. Dataset Checklist Cho Fine-tuning

6. Privacy Checklist

Phân loại data: public, internal, confidential, regulated.
Có redaction/anonymization pipeline.
Có policy cho dữ liệu không được đưa vào hosted training.
Có tenant isolation nếu training adapter theo tenant.
Có retention policy cho raw data, processed data, checkpoints và logs.
Có quyền xóa dữ liệu nếu user/customer yêu cầu.
Có review license của base model và dataset.

7. Eval Checklist

Eval chạy được tự động trong CI hoặc pipeline release.
Có offline eval trước deploy.
Có shadow/canary eval khi release.
Có threshold rõ cho go/no-go.
Có slice eval theo tenant, language, document type, product line, risk level.
Có metric cost và latency, không chỉ quality.
Có regression report giữa base model, previous adapter và new adapter.

Example go/no-go:

Deploy nếu:
- schema pass rate >= 99%
- faithfulness >= 95%
- safety violation &lt;= 0.2%
- p95 latency không tăng quá 15%
- cost/request không tăng quá 10%
- không có permission leak trong test suite

8. Rollback Checklist

Base model version được pin.
Adapter/model artifact có version immutable.
Prompt version có rollback.
Retrieval index version có rollback.
Schema version backward-compatible hoặc có migration.
Canary có kill switch.
Có route fallback theo task/tenant.
Dashboard hiển thị quality proxy, latency, cost, error, safety.
Log đủ metadata để biết response đến từ model/prompt/index nào.

9. Cost Model Template

Traffic:
- requests/day:
- avg input tokens:
- avg output tokens:
- p95 input tokens:
- p95 output tokens:

Prompt/RAG:
- retrieval calls/request:
- rerank calls/request:
- avg chunks:
- avg chunk tokens:
- cache hit rate:

Fine-tuning:
- data cleaning hours:
- labeling cost:
- training GPU hours:
- experiments count:
- artifact storage:
- eval runs:

Inference:
- hosted model cost/request:
- local GPU cost/hour:
- throughput request/sec/GPU:
- adapter memory overhead:
- batching strategy:

Decision:
- current cost/month:
- expected cost/month:
- break-even traffic:
- quality gain required:

10. Latency Budget Template

p95 target:
- gateway/auth:
- query rewrite:
- embedding/search:
- metadata filter/ACL:
- rerank:
- tool calls:
- LLM generation:
- validation:
- post-processing:
- observability overhead:

Optimization candidates:
- cache:
- parallelize:
- reduce top_k:
- smaller model:
- streaming:
- async workflow:
- distillation:

11. Production Readiness By Technique

Technique	Production-ready khi	Không production-ready khi
Prompt-only	Có versioning, eval, fallback, logs	Prompt nằm rải rác trong code, không metric
RAG	Có ACL, eval retrieval, citation, index version	Filter permission sau generation, không đo recall
Tool calling	Có auth, idempotency, audit, timeout	Model tự gọi side effect không kiểm soát
LoRA/QLoRA	Có dataset sạch, eval, registry, rollback	Train trên raw sensitive data, không canary
Full fine-tune	Có MLOps mạnh, data lớn, regression suite	Chỉ để sửa vài lỗi prompt
Distillation	Có teacher tốt, task hẹp, eval chặt	Task mở, cần reasoning rộng, không golden set

Bài tập

Mục Tiêu Thực Hành

Sau bài này, bạn cần tạo được 5 decision records gần production. Mỗi record phải trả lời:

Vấn đề thật sự là facts, realtime state, action, format, tone, workflow, cost hay latency?
Dùng prompt, RAG, tool calling, fine-tuning, distillation hay hybrid?
Vì sao không chọn các option còn lại?
Metric nào chứng minh decision đúng?
Dùng được trong production không? Nếu có thì cần điều kiện gì?

Chuẩn Bị

Tạo một file riêng để làm bài, ví dụ:

mkdir -p notes/day-25
touch notes/day-25/decision-records.md

Không cần API key. Bài này tập trung vào architecture decision và production checklist.

Template Bắt Buộc

Copy template này cho mỗi use case:

# Decision Record: <Tên use case>

## 1. Context

- Users:
- Workflow:
- Input:
- Output:
- Data sensitivity:
- Expected traffic:
- p95 latency target:
- Cost/request target:

## 2. Failure Mode Hiện Tại

- Facts sai:
- Không có citation:
- Cần realtime state/action:
- Format sai:
- Tone/style sai:
- Workflow sai:
- Latency/cost cao:
- Privacy/security risk:

## 3. Options

| Option | Quality | Cost | Latency | Ops complexity | Privacy risk | Rollback | Nhận xét |
|---|---:|---:|---:|---:|---:|---:|---|
| Prompt-only | | | | | | | |
| RAG | | | | | | | |
| Tool calling | | | | | | | |
| Fine-tune/PEFT | | | | | | | |
| Distillation | | | | | | | |
| Hybrid | | | | | | | |

## 4. Decision

- Chọn:
- Không chọn:
- Lý do:
- Trade-off chấp nhận:
- Điều kiện revisit:

## 5. Production Plan

- Data/index/tool cần chuẩn bị:
- Eval set:
- Metrics:
- Rollout:
- Rollback:
- Observability:

## 6. Dùng được trong production không? Nếu có thì cần điều kiện gì?

Trả lời cụ thể theo use case.

Exercise 1: Chatbot Hỏi Đáp Policy Nội Bộ

Scenario:

Công ty có 2.000 trang policy HR, security và finance. Policy thay đổi mỗi tuần.
User muốn hỏi bằng tiếng Việt, câu trả lời phải có citation. Một số tài liệu chỉ dành cho manager.

Yêu cầu:

Chọn giữa prompt-only, RAG, fine-tune hoặc hybrid.
Nêu cách enforce ACL.
Đề xuất metric retrieval và citation.
Trả lời vì sao không fine-tune policy vào model.

Gợi ý solution mong đợi:

RAG là core.
Fine-tune chỉ xét sau nếu tone/refusal/workflow lỗi dù context đúng.
Permission phải filter trước context builder.

Exercise 2: Support Assistant Tạo Ticket

Scenario:

Assistant nhận complaint của khách, kiểm tra order/account, đọc refund policy,
trả lời khách và tạo ticket nếu cần. Ticket system yêu cầu JSON đúng schema.

Yêu cầu:

Thiết kế flow có RAG, tool calling và structured output.
Nêu side effect nào cần confirmation hoặc idempotency.
Nêu khi nào nên fine-tune bằng resolved tickets.
Đề xuất rollback nếu adapter mới làm tăng escalation sai.

Gợi ý output contract:

{
  "customer_reply": "string",
  "ticket": {
    "category": "billing|delivery|account|other",
    "priority": "low|medium|high",
    "summary": "string"
  },
  "needs_human": true,
  "sources": ["policy://refund#section-2"]
}

Exercise 3: Extract Invoice Thành JSON

Scenario:

Hệ thống nhận invoice PDF đã OCR thành text. Cần extract vendor, invoice_number,
date, line_items, tax, total. Traffic 200.000 invoice/ngày.

Yêu cầu:

Chọn baseline.
Nêu metric field-level.
Khi nào distillation/fine-tune model nhỏ đáng làm?
Nêu data privacy checklist.

Gợi ý:

Không cần RAG nếu toàn bộ thông tin nằm trong invoice.
Structured output + validator là baseline.
Distillation/fine-tune đáng cân nhắc vì traffic cao và task hẹp.

Exercise 4: Code Review Assistant Theo Style Team

Scenario:

Team muốn assistant review PR theo coding standard nội bộ,
comment ngắn, severity rõ, không tạo noise. Có lịch sử 30.000 review comments.

Yêu cầu:

Tách phần nào nên là RAG, phần nào có thể fine-tune.
Có cần tool không? Nếu có, tool nào?
Nêu risk khi dùng lịch sử code review làm training data.
Đề xuất eval chống false positive.

Gợi ý:

Coding standard nên là RAG vì thay đổi.
Review style/severity mapping có thể LoRA nếu dataset sạch.
Tool có thể gồm static analysis, test result, code ownership.

Exercise 5: Product FAQ Với Giá Và Tồn Kho Realtime

Scenario:

E-commerce assistant trả lời câu hỏi sản phẩm. Mô tả sản phẩm thay đổi theo catalog.
Giá và tồn kho thay đổi từng phút. Không được trả lời sai giá.

Yêu cầu:

Chọn RAG, tool hoặc fine-tune cho từng loại data.
Thiết kế freshness guarantee.
Nêu latency budget.
Trả lời khi nào fine-tune có thể vẫn hữu ích.

Gợi ý:

Product docs/catalog: RAG hoặc search.
Price/inventory: tool.
Fine-tune chỉ cho tone/sales behavior, không cho facts realtime.

Exercise 6: Cost Và Latency Calculation

Với một assistant có số liệu:

100.000 requests/day
Prompt-only:
- avg input 2.000 tokens
- avg output 300 tokens
- p95 latency 2.2s

RAG:
- retrieval + rerank p95 500ms
- avg context thêm 1.200 tokens
- p95 latency 3.0s

Fine-tuned small model:
- avg input giảm còn 900 tokens
- avg output 220 tokens
- p95 latency 1.4s
- cần 2 tuần data cleaning/training/eval

Yêu cầu:

Nêu option nào nên ship trước nếu deadline là 1 tuần.
Nêu option nào đáng thử nếu traffic tăng lên 1 triệu requests/day.
Nêu metric cần đo trước khi kết luận fine-tune rẻ hơn.

Rubric Tự Chấm

Bạn đạt yêu cầu nếu mỗi decision record có:

Đáp Án Tham Khảo Ngắn

Use case	Decision chính	Fine-tune có nên dùng không?
Internal policy Q&A	RAG + ACL + citation	Có thể, chỉ cho tone/refusal nếu context đúng mà behavior sai
Support ticket	RAG + tool + structured output	Có, nếu resolved tickets sạch và triage/tone lỗi lặp lại
Invoice extraction	Structured output + validator	Có, nếu volume cao hoặc schema/field accuracy cần cải thiện
Code review	RAG standards + tools + optional LoRA	Có, cho style/severity nếu data sạch
Product FAQ realtime	RAG/search + price/inventory tools	Không cho facts; có thể cho sales tone