Day 17: LLM Fundamentals

Mục tiêu

Sau bài này, bạn cần làm được các việc sau:

Hiểu LLM sinh output bằng cơ chế next-token prediction, không phải truy vấn một database sự thật tuyệt đối.
Giải thích được tokenization, token IDs, logits, probability distribution, context window và decoding.
Phân biệt pre-training, supervised fine-tuning (SFT) và RLHF hoặc preference tuning.
Biết tính token budget cho một request gồm system prompt, user input, chat history, retrieved documents, tool results và output.
Chọn được temperature, top_p, top_k, max_tokens, stop sequences theo từng use case.
So sánh hosted closed model và local/open-weight model theo quality, cost, latency, privacy, security, compliance và vận hành.
Trả lời rõ: LLM dùng được trong production không, và cần điều kiện gì.

TL;DR

LLM là một runtime xác suất: nó nhận context, biến text thành token, dự đoán token tiếp theo, rồi lặp lại cho đến khi dừng. Khả năng chat tốt đến từ SFT và preference tuning; kiến thức rộng đến từ pre-training; nhưng factual correctness trong production vẫn cần retrieval, tool, validation, eval và monitoring.

Quyết định production không nên chỉ hỏi "model nào thông minh nhất". Cần hỏi thêm: request tốn bao nhiêu token, p95 latency là bao nhiêu, data có được gửi ra provider không, output có schema validate được không, model upgrade có golden set không, và khi provider lỗi thì fallback thế nào.

1. Day 17 Nằm Ở Đâu Trong Khóa Học

Day 9-16 đã xây nền tảng neural network, NLP, tokenizer, attention, Transformer và fine-tune classifier. Day 17 mở Phase 3: LLM Application Engineering.

Day 14: Transformer architecture
Day 15: Hugging Face ecosystem
Day 16: fine-tune classifier
Day 17: LLM runtime fundamentals
Day 18: prompt engineering
Day 19: structured output and function calling
Day 20: LLM app architecture for production

Với góc nhìn Senior Software Engineer:

LLM concept	SE analogy	Production implication
Prompt	Request contract	Version, test và rollback như API contract
Tokenizer	Parser/encoder	Đổi tokenizer có thể đổi behavior
Model weights	Runtime artifact	Pin model ID/version, có release note
Context window	Request payload limit	Cần budget cho input, retrieved docs và output
Decoding params	Runtime config	Thay đổi cần eval, log và canary
Output	External service response	Không tin raw text; cần parse, validate, guardrail
Evaluation set	Test suite xác suất	Bắt regression khi đổi prompt/model

2. LLM Sinh Text Như Thế Nào

Flow inference cơ bản:

text prompt
  -> tokenizer
  -> token IDs
  -> model forward pass
  -> logits cho token kế tiếp
  -> decoding chọn token
  -> append token vào context
  -> lặp lại đến max_tokens/stop/end token

Ví dụ rất đơn giản:

Prompt: "Thủ đô của Việt Nam là"
Token tiếp theo có xác suất cao: " Hà", " thủ", " thành", ...
Model chọn " Hà"
Context mới: "Thủ đô của Việt Nam là Hà"
Token tiếp theo có xác suất cao: " Nội", ...

Điểm quan trọng: model không "lookup" câu trả lời trong database. Nó học distribution từ training data và sinh token hợp lý theo context. Vì vậy output có thể đúng, sai, thiếu nguồn, hoặc nghe rất tự tin dù không có căn cứ.

Logits, probability và hallucination

Model trả về logits: điểm số chưa chuẩn hóa cho từng token trong vocabulary. Decoding chuyển logits thành probability distribution rồi chọn token.

Hallucination xảy ra khi chuỗi token nghe hợp lý về mặt ngôn ngữ nhưng không đúng với sự thật hoặc không được hỗ trợ bởi dữ liệu trong context. Trong production, hallucination không được giải quyết bằng "prompt hay hơn" một cách tuyệt đối; thường cần kết hợp:

Retrieval từ nguồn đã kiểm soát.
Tool/API để lấy dữ liệu realtime.
Citation hoặc evidence.
Output validation.
Human review cho workflow rủi ro cao.
Golden evaluation set để đo lỗi theo domain.

3. Tokenization: Vì Sao Token Không Phải Word

Tokenizer biến text thành các đơn vị mà model hiểu được. Token có thể là một từ, một phần của từ, khoảng trắng, dấu câu hoặc byte-level fragment.

"AI Engineer" -> ["AI", " Engineer"] hoặc token IDs tương ứng
"không" -> có thể là một token hoặc nhiều subword tùy tokenizer
"customer_id=12345" -> thường bị tách thành nhiều token

Tác động production:

Cùng một câu có thể tốn token khác nhau giữa model A và model B.
Tiếng Việt, code, JSON, log line và text có nhiều ký tự đặc biệt có thể token hóa kém hơn English prose.
Cost thường tính theo input token và output token.
Latency tăng khi input dài và output dài.
Truncation sai có thể cắt mất instruction, constraint hoặc dữ liệu quan trọng.

Rule thực tế: luôn đo token count bằng tokenizer đúng của model đang dùng, không ước lượng bằng số từ.

4. Training Pipeline: Pre-training, SFT, RLHF Và Preference Tuning

Stage	Model học gì	Input điển hình	Tác động đến application
`Pre-training`	Predict next token trên corpus rất lớn	Web text, code, books, mixed data	Ngôn ngữ, pattern, kiến thức rộng, khả năng code/reasoning nền
`Supervised fine-tuning` (`SFT`)	Follow instruction qua cặp prompt/answer	Instruction dataset, chat transcript đã curate	Biết trả lời dạng assistant, tuân thủ format tốt hơn
`RLHF`	Tối ưu theo human preference	Ranking/preference label của con người	Output helpful/harmless hơn, style ổn hơn
`RLAIF`	Preference từ AI feedback	AI-generated preference	Scale rẻ hơn human label, vẫn cần kiểm định
`DPO` và preference tuning khác	Học trực tiếp từ cặp chosen/rejected	Chosen answer vs rejected answer	Pipeline đơn giản hơn RL truyền thống, hữu ích khi align model theo domain

Nhầm lẫn phổ biến:

Pre-training không bảo đảm model biết mọi sự thật mới nhất.
SFT không tự biến model thành domain expert nếu dữ liệu instruction không đủ.
RLHF không biến model thành source of truth; nó làm output hợp preference hơn.
Fine-tune không phải giải pháp mặc định cho factual knowledge. Với dữ liệu thay đổi thường xuyên, RAG hoặc tool call thường đúng hơn.

5. Context Window Và Token Budget

Context window là số token tối đa model có thể nhìn thấy trong một request, gồm cả input và output. Nếu model có context window lớn, bạn vẫn không nên nhồi mọi thứ vào prompt.

Token budget nên tính như sau:

system prompt
+ developer/app instruction
+ prompt template
+ chat history hoặc summary
+ retrieved documents
+ user input
+ tool results
+ reserved output tokens
<= model context window

Ví dụ:

Context window: 16,000 tokens
System + policy: 1,000
Chat history summary: 1,500
Retrieved docs: 8,000
User question: 500
Reserved output: 2,000
Safety margin: 1,000
Total: 14,000

Nếu vượt budget, đừng cắt bừa phần cuối. Cần strategy:

Compress chat history thành summary có metadata.
Rerank retrieved documents để giữ top chunks quan trọng.
Loại bỏ boilerplate trong prompt.
Giới hạn tool results.
Reserve output tokens trước.
Fail fast với error rõ nếu request vượt policy.

Long context có trade-off:

Lợi ích	Chi phí/rủi ro
Đưa được nhiều tài liệu hơn	Cost input token tăng
Giảm số lần call retrieval trong case nhỏ	Latency prefill tăng
Hữu ích cho phân tích tài liệu dài	Model có thể bỏ sót thông tin giữa context
Đơn giản hơn orchestration v1	PII/logging risk tăng

6. Decoding Params: Điều Khiển Output

Decoding là bước chọn token tiếp theo từ probability distribution.

Param	Ý nghĩa	Dùng khi	Rủi ro nếu sai
`temperature=0` hoặc thấp	Output ổn định hơn, ưu tiên token xác suất cao	Classification, extraction, JSON, compliance	Có thể khô, ít đa dạng
`temperature=0.2-0.5`	Cân bằng ổn định và linh hoạt	Support answer, summarization, rewrite nghiêm túc	Vẫn có nondeterminism
`temperature>=0.7`	Sáng tạo hơn	Brainstorm, copywriting, ideation	Dễ drift format, factual risk cao hơn
`top_p`	Sample trong cumulative probability mass	Giảm token quá hiếm	Tune cùng temperature bừa bãi làm khó debug
`top_k`	Chỉ chọn trong k token top	Hay gặp ở local runtime	k quá thấp làm output nghèo
`max_tokens`	Giới hạn output	Kiểm soát cost/latency	Quá thấp gây output bị cụt
`stop sequences`	Dừng khi gặp marker	Protocol/template cụ thể	Marker sai làm cắt nhầm
`seed` nếu provider hỗ trợ	Tăng reproducibility	Test và regression	Không phải provider nào cũng đảm bảo tuyệt đối

Rule v1:

Extraction/schema: temperature=0 hoặc rất thấp, max_tokens chặt, validate schema.
Customer support: temperature=0.2-0.4, citation nếu trả lời theo policy.
Creative writing: temperature cao hơn, human review nếu public-facing.
Không tune temperature, top_p, top_k cùng lúc khi chưa có eval.
Mọi thay đổi decoding params phải đi qua golden set.

7. Hosted Model Vs Local/Open-weight Model

Hosted closed model: provider vận hành model, bạn gọi API. Ví dụ category: GPT, Claude, Gemini.

Open-weight/local model: bạn dùng weights có thể tải về và serve bằng runtime như Ollama, llama.cpp, vLLM, TGI hoặc custom service. Ví dụ category: Llama, Qwen, Mistral, DeepSeek-style open-weight models.

Tiêu chí	Hosted model	Local/open-weight model
Go-live	Nhanh	Chậm hơn vì cần serving stack
Quality frontier	Thường mạnh, cập nhật nhanh	Tùy model, hardware và tuning
Ops	Nhẹ hơn	Cần GPU/CPU capacity, batching, monitoring
Privacy	Phụ thuộc data policy provider	Kiểm soát tốt hơn nếu self-host đúng cách
Cost	Dễ bắt đầu, có thể đắt khi scale	Capex/infra cao, có thể rẻ ở volume lớn
Latency	Network + provider queue	Có thể thấp nếu đặt gần app, nhưng phụ thuộc hardware
Compliance	Cần review vendor	Cần review license, model source, deployment controls
Upgrade	Provider đổi nhanh, có deprecation	Bạn kiểm soát version, nhưng tự chịu burden

Decision rule thực dụng:

POC hoặc product cần quality cao nhanh: bắt đầu hosted.
Dữ liệu cực nhạy cảm, offline, air-gapped hoặc compliance nghiêm: cân nhắc local/open-weight.
Task routing/classification/extraction đơn giản: benchmark small model trước.
Task reasoning/code/multi-step khó: dùng stronger model hoặc route fallback.
Ở scale lớn: tính total cost of ownership, không chỉ giá mỗi token.

8. Cost, Latency, Security Và Performance

Cost

Cost/request thường gồm:

input_tokens * input_price
+ output_tokens * output_price
+ retrieval/tool cost
+ retry cost
+ observability/storage cost
+ human review cost nếu có

Các nguyên nhân gây cost spike:

Chat history không được tóm tắt.
Retrieved chunks quá nhiều.
max_tokens quá rộng.
Retry không giới hạn.
Người dùng paste log dài hoặc file lớn.
Prompt template chứa boilerplate dư.

Latency

Latency thường gồm:

client -> app validation -> retrieval/tool call -> LLM prefill -> token generation -> postprocess -> response

Điểm cần nhớ:

Input dài làm prefill chậm.
Output dài làm user cảm thấy chậm vì token generation phải sinh tuần tự.
Streaming cải thiện perceived latency nhưng không làm total compute biến mất.
Batching tăng throughput nhưng có thể tăng queueing latency.
Local model cần quản lý KV cache, quantization, GPU memory và cold start.

Security

LLM app có attack surface khác backend thường:

Prompt injection trong user input hoặc retrieved documents.
Data exfiltration qua tool call.
Secret bị đưa vào prompt hoặc log.
Output gây hành động sai nếu không có approval gate.
Model/provider policy không phù hợp dữ liệu nhạy cảm.

Minimum controls:

Không đưa API key, password, private token vào prompt.
Redact PII trong log nếu không có lý do rõ.
Tách quyền tool theo user/session.
Validate output trước khi gọi side-effect tool.
Rate limit và quota theo tenant.
Audit prompt/model/tool version.

9. Production Readiness

LLM dùng được trong production không? Có, nếu scope đúng và có điều kiện vận hành rõ.

Dùng được khi

Use case chịu được xác suất hoặc có validation/human review.
Có golden set đại diện domain để test prompt/model/decoding.
Output có contract rõ: JSON schema, citation requirement hoặc action policy.
Có monitoring: latency, token usage, cost/request, error rate, parse failure, user feedback.
Có data policy: retention, PII, secret handling, vendor review.
Có fallback: retry có giới hạn, model fallback, cached answer, graceful degradation.
Có rollback khi model/prompt/provider thay đổi.

Không nên dùng trực tiếp khi

Quyết định high-stakes không có human approval.
Cần factual correctness tuyệt đối nhưng không có authoritative source/tool.
Không thể gửi dữ liệu ra provider và cũng chưa có local deployment an toàn.
Không có cách đo quality ngoài cảm giác.
Output text tự do được đưa thẳng vào workflow có side effect.

Production v1 checklist

model_id, version và decoding params được pin.
Prompt có version và owner.
Token budget được tính trước khi gọi model.
Output được validate bằng schema hoặc rule rõ.
Có golden evaluation set.
Có log không chứa raw PII mặc định.
Có dashboard cost/latency/token.
Có timeout, retry budget và fallback.
Có release process cho prompt/model changes.

10. Mini Architecture Cho Day 17

Một LLM wrapper tối thiểu nhưng gần production:

API endpoint
  -> validate request size and tenant quota
  -> build prompt from versioned template
  -> estimate/count tokens
  -> call model with pinned config
  -> validate output contract
  -> log metrics without raw sensitive text
  -> return stable response

Ví dụ response nên có metadata đủ để debug:

{
  "answer": "LLM có thể dùng trong production nếu có eval, monitoring, guardrails và rollback.",
  "model": "example-model",
  "prompt_version": "llm-fundamentals-v1",
  "finish_reason": "stop",
  "usage": {
    "input_tokens": 812,
    "output_tokens": 72
  },
  "latency_ms": 1840
}

Không nên expose toàn bộ raw provider response cho client. Hãy normalize response schema để provider/model có thể thay đổi phía sau abstraction.

11. Trade-off Tổng Hợp

Lựa chọn	Nên dùng khi	Không nên dùng khi	Production note
Hosted LLM	Cần go-live nhanh, quality cao, ops nhẹ	Data không được rời hệ thống, cost khó kiểm soát	Review vendor policy, log usage, có fallback
Local/open-weight LLM	Cần privacy/control/offline hoặc volume lớn	Team chưa có GPU ops và eval năng lực model	Cần serving, security, monitoring, capacity planning
Large model	Task ambiguous, reasoning, code, multi-step	Task đơn giản có rule/model nhỏ đủ	Dùng routing để tránh lãng phí
Small model	Classification, extraction, routing, latency thấp	Cần reasoning sâu hoặc instruction phức tạp	Benchmark theo domain
Long context	Cần đọc tài liệu dài trong một request	Corpus lớn có thể search được	Kết hợp RAG/rerank thay vì nhồi context
Low temperature	Output cần ổn định, parse được	Brainstorm sáng tạo	Vẫn cần validation
High temperature	Ideation, creative draft	Compliance, JSON, factual QA	Cần human review hoặc guardrail

12. Kết Luận

LLM fundamentals cho AI Engineer không dừng ở "model sinh chữ". Bạn cần nhìn LLM như một dependency runtime có cost, latency, security boundary, config, version, test suite và failure mode riêng. Khi nắm được tokenization, next-token prediction, context budget, decoding và model choice, bạn sẽ học Day 18-20 hiệu quả hơn vì mọi prompt, structured output và architecture decision đều dựa trên các ràng buộc này.

Tài liệu

1. Glossary Nhanh

Thuật ngữ	Giải thích ngắn	Lưu ý production
Token	Đơn vị text sau tokenizer	Cost/latency tính theo token, không phải word
Tokenizer	Bộ mã hóa text thành token IDs	Phải khớp với model version
Context window	Tổng token tối đa model nhìn thấy	Gồm cả input và output
Logits	Điểm số model trả cho token kế tiếp	Decoding biến logits thành lựa chọn token
Temperature	Điều chỉnh độ ngẫu nhiên	Thấp cho stability, cao cho creativity
Top-p	Nucleus sampling theo probability mass	Dùng cẩn thận cùng temperature
Top-k	Giới hạn trong k token top	Phổ biến ở local runtime
Max tokens	Giới hạn output token	Kiểm soát cost và latency
SFT	Supervised fine-tuning theo instruction	Tăng khả năng follow instruction
RLHF	Reinforcement learning from human feedback	Align preference, không bảo đảm factuality
DPO	Direct Preference Optimization	Preference tuning đơn giản hơn RLHF truyền thống
Open-weight	Weights có thể tải về theo license	Không đồng nghĩa miễn phí hoặc production-safe

2. Token Budget Template

Dùng template này trước khi thiết kế prompt hoặc endpoint.

# Token Budget

Use case:
Model/context window:
Max output tokens reserved:
Safety margin:

| Component | Estimated tokens | Required? | Notes |
|---|---:|---|---|
| System prompt |  | yes |  |
| Developer/app instruction |  | yes |  |
| User input |  | yes |  |
| Chat history |  | no | Summary or last N turns |
| Retrieved documents |  | no | Top chunks after rerank |
| Tool results |  | no | Truncate or summarize |
| Output reservation |  | yes | max_tokens |
| Safety margin |  | yes | avoid overflow |
| Total |  |  | must be <= context window |

Overflow strategy:
- Drop:
- Summarize:
- Rerank:
- Reject with clear error:

Rule nhanh:

Reserve output token trước khi nhét docs.
Với RAG, ưu tiên ít chunk nhưng liên quan cao.
Với chat, không giữ toàn bộ history mãi; dùng summary có timestamp/source.
Với tool result lớn, summarize hoặc paginate.

3. Decoding Decision Table

Use case	Temperature	Top-p	Max tokens	Extra controls
JSON extraction	0-0.1	default/1.0	Chặt	Schema validation, retry repair có giới hạn
Classification	0	default/1.0	Rất thấp	Prefer enum output
Customer support answer	0.2-0.4	0.9-1.0	Vừa	Citation, policy source, refusal rule
Summarization	0.2-0.5	0.9-1.0	Theo length target	Check coverage
Brainstorm	0.7-1.0	0.9-0.95	Rộng hơn	Human selection
Code generation	0.1-0.4	0.9-1.0	Theo task	Tests, static analysis

Không có config tốt tuyệt đối. Config đúng là config thắng trên evaluation set của use case cụ thể.

4. Hosted Vs Local Decision Record

# LLM Model Decision Record

## Context

- Use case:
- Users:
- Data sensitivity:
- SLA:
- Expected traffic:
- Required languages:
- Expected output format:

## Options

| Option | Model/provider | Pros | Cons | Estimated cost | Estimated latency |
|---|---|---|---|---:|---:|
| Hosted strong model |  |  |  |  |  |
| Hosted small model |  |  |  |  |  |
| Local/open-weight model |  |  |  |  |  |

## Security and compliance

- Can data leave our infra:
- Retention policy:
- PII handling:
- Vendor review required:
- Open-weight license review required:

## Evaluation

- Golden set size:
- Quality metric:
- Format validity metric:
- Factuality/citation metric:
- Latency p50/p95:
- Cost per 1,000 requests:

## Decision

- Selected option:
- Why:
- Required guardrails:
- Fallback:
- Rollback:
- Review date:

5. Cost Worksheet

requests_per_day = 50,000
avg_input_tokens = 1,200
avg_output_tokens = 250
retry_rate = 3%

daily_input_tokens = requests_per_day * avg_input_tokens * (1 + retry_rate)
daily_output_tokens = requests_per_day * avg_output_tokens * (1 + retry_rate)

daily_llm_cost =
  daily_input_tokens / 1_000_000 * input_price_per_1m
+ daily_output_tokens / 1_000_000 * output_price_per_1m

Ngoài token price, đừng quên:

Embedding/retrieval cost nếu có RAG.
Vector DB hoặc search infra.
Observability storage.
Human review.
GPU/CPU serving nếu self-host.
On-call và capacity planning.

6. Latency Breakdown Template

# Latency Breakdown

| Step | p50 ms | p95 ms | Notes |
|---|---:|---:|---|
| API validation |  |  |  |
| Auth/quota |  |  |  |
| Retrieval/search |  |  |  |
| Reranking |  |  |  |
| Prompt build/token count |  |  |  |
| LLM prefill |  |  | Input length sensitive |
| LLM generation |  |  | Output length sensitive |
| Output validation |  |  |  |
| Total |  |  |  |

Optimization order thường hợp lý:

Cắt prompt boilerplate và retrieved docs dư.
Giảm output verbosity.
Dùng streaming cho UX.
Route task đơn giản sang small model.
Cache deterministic result hoặc stable prefix nếu provider/runtime hỗ trợ.
Với local model, benchmark batching, quantization và serving runtime.

7. Observability Fields

Log metadata, không log raw sensitive text mặc định.

{
  "request_id": "req_123",
  "tenant_id": "tenant_hash",
  "use_case": "support_answer",
  "prompt_version": "support-v3",
  "model": "provider/model-version",
  "decoding": {
    "temperature": 0.2,
    "top_p": 1.0,
    "max_tokens": 512
  },
  "usage": {
    "input_tokens": 1432,
    "output_tokens": 218
  },
  "latency_ms": 2410,
  "finish_reason": "stop",
  "validation_status": "passed",
  "fallback_used": false,
  "cost_estimate": 0.0
}

Metrics tối thiểu:

p50/p95/p99 latency.
Input/output tokens per request.
Cost per request và cost per tenant.
Timeout/rate limit/provider error.
Schema validation failure.
Retry/fallback rate.
User thumbs up/down hoặc domain-specific quality signal.

8. Output Contract Checklist

Output có schema rõ không?
Có enum thay vì free text cho class/action không?
Có maximum length không?
Có citation/evidence nếu factual QA không?
Có policy khi thiếu thông tin không?
Có validation trước khi lưu DB hoặc gọi tool không?
Có retry repair không, và retry tối đa mấy lần?
Có test case cho malformed output không?

9. Security Checklist

Không đưa secret vào prompt.
Không log raw prompt/response chứa PII mặc định.
Có data retention policy cho provider hoặc self-host log.
Tool permissions được scope theo user/tenant.
Retrieved documents có permission filter trước khi đưa vào context.
Output không được tự động thực hiện side effect high-risk.
Có prompt injection tests.
Có rate limit và quota.
Có audit trail cho prompt/model/tool version.

10. Production Readiness Answer

LLM có thể dùng trong production khi được treat như một external probabilistic dependency:

Có boundary rõ: input validation, token budget, output contract.
Có quality control: golden set, eval trước khi đổi prompt/model.
Có runtime control: timeout, retry, fallback, rollback.
Có cost control: token logging, quota, budget alert.
Có security control: PII/secret handling, permission-aware retrieval, tool guardrails.
Có observability: latency, tokens, cost, error, quality feedback.

Nếu thiếu những điều kiện này, LLM vẫn có thể dùng cho prototype hoặc internal low-risk workflow, nhưng chưa nên tự động hóa quyết định quan trọng.

Bài tập

Mục tiêu thực hành

Hoàn thành bài này để bạn có dữ liệu thực tế về decoding params, token budget, cost, latency, output stability và model choice. Kết quả cuối bài là một model decision note ngắn cho một LLM feature production-style.

Yêu cầu môi trường

Chọn một trong hai hướng:

Local: cài Ollama và pull một model nhỏ, ví dụ llama3.1:8b, qwen2.5:7b hoặc model tương đương máy bạn chạy được.
Hosted: dùng provider có API tương thích OpenAI-style. Không commit API key vào repo.

Python packages:

pip install -U requests pydantic

Exercise 1: Chạy Cùng Prompt Với Nhiều Decoding Params

Tạo file tạm trong máy bạn, ví dụ day17_decode_experiment.py:

from __future__ import annotations

import json
import os
import time
from dataclasses import dataclass
from typing import Any

import requests
from pydantic import BaseModel, Field, ValidationError


BASE_URL = os.getenv("OLLAMA_URL", "http://localhost:11434")
MODEL = os.getenv("LLM_MODEL", "llama3.1:8b")
TIMEOUT_SECONDS = float(os.getenv("LLM_TIMEOUT_SECONDS", "180"))

PROMPT = """Bạn là AI Engineer đang review một feature LLM.
Hãy trả lời bằng tiếng Việt có dấu.

Task:
Liệt kê 5 rủi ro production khi dùng LLM cho customer support.
Mỗi rủi ro gồm:
- risk: tên rủi ro ngắn
- impact: tác động
- mitigation: cách giảm rủi ro

Chỉ trả về JSON object hợp lệ theo schema:
{
  "risks": [
    {"risk": "...", "impact": "...", "mitigation": "..."}
  ]
}
"""


class Risk(BaseModel):
    risk: str = Field(min_length=3)
    impact: str = Field(min_length=10)
    mitigation: str = Field(min_length=10)


class RiskReport(BaseModel):
    risks: list[Risk] = Field(min_length=5, max_length=5)


@dataclass(frozen=True)
class RunConfig:
    name: str
    temperature: float
    top_p: float
    max_tokens: int


def call_ollama(config: RunConfig) -> dict[str, Any]:
    started = time.perf_counter()
    response = requests.post(
        f"{BASE_URL}/api/generate",
        json={
            "model": MODEL,
            "prompt": PROMPT,
            "stream": False,
            "options": {
                "temperature": config.temperature,
                "top_p": config.top_p,
                "num_predict": config.max_tokens,
            },
        },
        timeout=TIMEOUT_SECONDS,
    )
    response.raise_for_status()
    data = response.json()
    text = data.get("response", "")
    latency_ms = round((time.perf_counter() - started) * 1000, 2)

    validation_error = None
    parsed = None
    try:
        parsed = RiskReport.model_validate_json(text)
    except ValidationError as exc:
        validation_error = str(exc).splitlines()[0]

    return {
        "config": config.__dict__,
        "latency_ms": latency_ms,
        "prompt_tokens": data.get("prompt_eval_count"),
        "output_tokens": data.get("eval_count"),
        "finish_reason": data.get("done_reason"),
        "valid_json": parsed is not None,
        "validation_error": validation_error,
        "text_preview": text[:600],
    }


def main() -> None:
    configs = [
        RunConfig("deterministic", temperature=0.0, top_p=1.0, max_tokens=600),
        RunConfig("low_creativity", temperature=0.2, top_p=0.9, max_tokens=600),
        RunConfig("balanced", temperature=0.5, top_p=0.95, max_tokens=600),
        RunConfig("creative", temperature=0.9, top_p=0.95, max_tokens=600),
    ]

    results = []
    for config in configs:
        for attempt in range(1, 4):
            result = call_ollama(config)
            result["attempt"] = attempt
            results.append(result)
            print(json.dumps(result, ensure_ascii=False, indent=2))

    valid_count = sum(1 for item in results if item["valid_json"])
    print(
        json.dumps(
            {
                "total_runs": len(results),
                "valid_json_runs": valid_count,
                "json_validity_rate": round(valid_count / len(results), 4),
            },
            ensure_ascii=False,
            indent=2,
        )
    )


if __name__ == "__main__":
    main()

Ghi lại bảng:

Config	Attempt	Latency ms	Prompt tokens	Output tokens	Valid JSON	Nhận xét stability
deterministic	1
deterministic	2
deterministic	3
creative	1

Câu hỏi:

Config nào ổn định nhất?
Config nào dễ làm hỏng JSON nhất?
Output token có ảnh hưởng latency thế nào?
Với extraction/schema, bạn chọn config nào?

Exercise 2: Token Budget Cho Use Case Customer Support

Giả sử bạn build support assistant:

System prompt: 700 tokens.
Developer instruction: 400 tokens.
Chat history summary: 1,200 tokens.
User message: 300 tokens.
Retrieved docs: 6 chunks, mỗi chunk 900 tokens.
Tool result: 1,000 tokens.
Model context window: 16,000 tokens.
Bạn muốn reserve output 1,500 tokens và safety margin 1,000 tokens.

Tính:

total = system + developer + history + user + docs + tool + output + margin

Trả lời:

Có vượt context window không?
Nếu cần giảm 3,000 tokens, bạn giảm ở đâu trước?
Vì sao không nên cắt system prompt hoặc user message đầu tiên?
Khi nào nên dùng reranking?

Gợi ý production answer:

Giữ system/developer instruction ngắn nhưng không cắt mù.
Giảm số retrieved chunks sau rerank.
Summarize chat history.
Summarize hoặc paginate tool result.
Reserve output token cố định theo response contract.

Exercise 3: Cost Estimate

Giả sử provider tính:

Input: 0.50 USD / 1M tokens.
Output: 2.00 USD / 1M tokens.

Traffic:

30,000 requests/day.
Average input: 1,800 tokens.
Average output: 350 tokens.
Retry rate: 4%.

Tính:

daily_input_tokens = requests * avg_input * (1 + retry_rate)
daily_output_tokens = requests * avg_output * (1 + retry_rate)
daily_cost = input_tokens / 1_000_000 * input_price
           + output_tokens / 1_000_000 * output_price
monthly_cost = daily_cost * 30

Sau đó trả lời:

Nếu output tăng từ 350 lên 900 tokens thì monthly cost đổi thế nào?
Cách giảm cost nào ít ảnh hưởng quality nhất?
Bạn đặt quota hoặc budget alert ở mức nào?

Exercise 4: Hosted Vs Local Decision

Điền bảng cho một use case của bạn.

Tiêu chí	Hosted strong model	Hosted small model	Local/open-weight model
Quality expected
p95 latency expected
Cost/request
Data sensitivity
Ops complexity
Security/compliance risk
Upgrade/rollback
Decision

Kết luận cần có dạng:

For production v1, I choose ...

Reason:
- ...

Required conditions:
- Golden set:
- Logging:
- Data policy:
- Fallback:
- Rollback:

Exercise 5: Production Readiness Review

Review đoạn pseudo-design sau:

Frontend gửi toàn bộ chat history và file content lên API.
API nối string vào prompt.
API gọi model mạnh nhất với temperature 0.8 và max_tokens 4000.
Response raw text được hiển thị cho user và lưu nguyên văn vào database.
Không log token usage.
Không validate output.
Không có eval set.

Tìm ít nhất 10 vấn đề, phân loại theo:

Cost.
Latency.
Security/privacy.
Reliability.
Quality.
Maintainability.

Đề xuất design sửa lại theo flow:

validate request
-> permission filter
-> summarize/truncate history
-> retrieve/rerank documents
-> build versioned prompt
-> count token budget
-> call model with task-specific decoding
-> validate output
-> log safe metadata
-> fallback or return stable response

Deliverable Cuối Bài

Tạo một note ngắn:

# Day 17 Model Choice Notes

## Use case

## Token budget

## Decoding config

## Hosted vs local decision

## Cost estimate

## Latency expectation

## Security policy

## Production readiness

LLM dùng được trong production cho use case này không?
Nếu có, điều kiện bắt buộc là gì?
Nếu chưa, blocker là gì?

Checklist hoàn thành:

Chạy hoặc mô phỏng decoding experiment.
Có bảng latency/token/output validity.
Tính được token budget.
Tính được cost estimate.
Có hosted vs local decision.
Có câu trả lời production readiness rõ ràng.