Day 30: Quantization & Deploy Local Model API

Mục Tiêu

Sau bài này, bạn cần làm được các việc sau:

Giải thích FP32, FP16, BF16, INT8, INT4 và tác động của dtype lên memory, latency, throughput, cost.
Phân biệt GGUF, AWQ, GPTQ ở góc nhìn format, runtime và deployment.
Ước lượng RAM/VRAM cho model weights, KV cache, runtime overhead và concurrency.
Chọn quantization theo context thay vì chọn theo cảm tính.
Expose local model qua FastAPI gateway có request/response schema, health, readiness, timeout, concurrency limit và structured logging.
Benchmark latency, throughput, tokens/sec, memory usage và quality regression trước/sau quantization.
Trả lời rõ: dùng được trong production không, nếu có thì cần điều kiện gì.

TL;DR

Quantization giảm memory footprint bằng cách lưu weights ở precision thấp hơn, ví dụ INT8 hoặc INT4 thay vì FP16/BF16. Nó giúp chạy model lớn hơn trên cùng phần cứng và có thể giảm cost, nhưng không tự động làm model nhanh hơn hoặc tốt hơn. Bottleneck có thể chuyển sang KV cache, prefill, decode kernel, CPU memory bandwidth, GPU occupancy hoặc network/API queue.

Production local model API không nên chỉ là một script gọi model. Bạn cần một gateway có schema rõ, timeout, readiness, concurrency control, logging, benchmark, quality eval, fallback và rollback. Nếu không đo cả latency, memory và quality, bạn chưa thật sự biết quantized model có dùng được không.

1. Bài Này Nằm Ở Đâu Trong Lộ Trình

Day 25-28 giúp bạn quyết định khi nào fine-tune, chuẩn bị dataset, chạy LoRA/QLoRA và evaluate model. Day 29 giới thiệu Ollama, llama.cpp, vLLM và local LLM runtime. Day 30 là bước đóng gói local model thành API có thể đưa vào hệ thống thật.

Day 25: quyết định RAG/fine-tune/tool/prompt
Day 26: dataset instruction tuning
Day 27: LoRA/QLoRA hands-on
Day 28: evaluation trước/sau fine-tune
Day 29: local LLM runtime: Ollama, llama.cpp, vLLM
Day 30: quantization + local model API + benchmark

Kỹ năng chính của ngày này không phải là biết thật nhiều tên format. Kỹ năng chính là ra quyết định engineering: model nào, quantization nào, runtime nào, trên phần cứng nào, với SLA nào, và regression chấp nhận được là bao nhiêu.

2. Mental Model: Precision Là Gì

Model có hàng tỷ tham số. Mỗi tham số là một số. Precision quyết định số đó được lưu bằng bao nhiêu byte và biểu diễn chi tiết tới mức nào.

Dtype	Bytes/param	Hay dùng khi	Điểm mạnh	Điểm yếu
FP32	4	Training baseline, research, CPU ops cần chính xác	Ổn định, ít lỗi số học	Quá tốn memory cho inference LLM
FP16	2	GPU inference/training phổ biến	Nhanh, tiết kiệm 50% so với FP32	Dynamic range hẹp hơn BF16
BF16	2	GPU mới, training/inference ổn định	Dynamic range gần FP32 hơn FP16	Không phải phần cứng nào cũng tối ưu
INT8	1	Quantized inference	Giảm memory mạnh, quality thường còn tốt	Cần kernel/runtime hỗ trợ tốt
INT4	0.5	Local/edge/GPU VRAM hạn chế	Chạy được model lớn hơn nhiều	Quality regression dễ thấy hơn, kernel phụ thuộc runtime

Rule đơn giản:

weights_memory_gb ~= params_billion * bytes_per_param

Ví dụ rough cho weights, chưa tính KV cache và overhead:

Model size	FP32	FP16/BF16	INT8	INT4 thực tế
3B	~12GB	~6GB	~3GB	~1.8-2.5GB
7B	~28GB	~14GB	~7GB	~4-5.5GB
13B	~52GB	~26GB	~13GB	~7.5-10GB
70B	~280GB	~140GB	~70GB	~38-50GB

INT4 thực tế thường lớn hơn params * 0.5 byte vì có scale, metadata, group size, tensor alignment và runtime overhead.

3. FP32, FP16, BF16 Step By Step

FP32

FP32 là baseline dễ hiểu nhất: mỗi số dùng 32 bit. Khi training hoặc debug numerical issue, FP32 an toàn hơn. Với LLM inference, FP32 gần như không kinh tế vì memory và bandwidth quá lớn.

Nên dùng FP32 khi:

Đang kiểm chứng correctness ở model nhỏ.
Một operation cụ thể bị unstable ở precision thấp.
CPU inference cho model nhỏ và latency không quan trọng.

Không nên dùng FP32 cho chat LLM production nếu có lựa chọn FP16/BF16 hoặc quantized.

FP16

FP16 giảm một nửa memory so với FP32 và được GPU hỗ trợ rất tốt. Nhiều model serving GPU dùng FP16 làm baseline inference.

Nên dùng FP16 khi:

Có GPU đủ VRAM.
Cần quality baseline trước khi quantize.
Runtime/kernel hỗ trợ FP16 tốt.

Rủi ro: một số workload có thể gặp overflow/underflow hoặc quality issue nếu model không phù hợp.

BF16

BF16 cũng dùng 2 byte như FP16 nhưng có dynamic range tốt hơn. Với GPU hiện đại, BF16 thường là lựa chọn tốt cho training/inference nếu được hỗ trợ.

Nên dùng BF16 khi:

GPU hỗ trợ BF16 tốt.
Bạn muốn độ ổn định số học tốt hơn FP16.
Model checkpoint hoặc runtime khuyến nghị BF16.

Trade-off: nếu phần cứng hoặc kernel không tối ưu BF16, latency có thể không tốt bằng FP16.

4. INT8 Và INT4 Step By Step

Quantization chuyển weights từ số thực precision cao sang số nguyên precision thấp cùng scale. Mục tiêu là giảm memory và memory bandwidth.

INT8

INT8 thường là điểm cân bằng tốt khi bạn muốn giảm memory nhưng chưa muốn chịu regression lớn như INT4.

Phù hợp khi:

GPU/CPU memory không đủ cho FP16/BF16.
Task cần quality tương đối ổn định.
Bạn có golden set để đo regression.

Không nên chọn nếu runtime không có kernel INT8 tốt. Khi đó INT8 có thể tiết kiệm memory nhưng latency không cải thiện đáng kể.

INT4

INT4 giảm memory mạnh hơn, thường là chìa khóa để chạy model 7B/13B trên laptop hoặc GPU nhỏ. Nhưng INT4 dễ làm giảm chất lượng hơn, đặc biệt ở reasoning, math, code, structured output và tiếng Việt nếu base model yếu.

Phù hợp khi:

Mục tiêu chính là fit vào RAM/VRAM.
Traffic thấp hoặc medium.
Task không quá nhạy với lỗi nhỏ.
Có fallback hoặc human review.

Không phù hợp khi:

Output phải tuyệt đối đúng schema hoặc số liệu.
SLA chặt, traffic cao, context dài.
Chưa có eval trước/sau quantization.

5. GGUF, AWQ, GPTQ

Không có chuyện "INT4 nào cũng như nhau". Bạn phải nói rõ format, runtime, kernel, model và hardware.

Format	Runtime thường gặp	Mạnh ở đâu	Hạn chế	Context tốt
GGUF	llama.cpp, Ollama	Local CPU, Apple Silicon, GPU offload, file portable	Scale production lớn cần tự thiết kế nhiều hơn	Dev local, edge, internal tool nhỏ
AWQ	vLLM, TensorRT-LLM, ExLlama tùy model	INT4 GPU inference, thường giữ quality tốt	Cần kernel/runtime hỗ trợ đúng	GPU serving muốn tiết kiệm VRAM
GPTQ	ExLlama, AutoGPTQ, một số GPU runtime	Post-training quantization phổ biến	Quality/kernel phụ thuộc checkpoint	Community model, GPU nhỏ

GGUF thường có các mức như Q8_0, Q6_K, Q5_K_M, Q4_K_M. Với local LLM, Q4_K_M thường là điểm bắt đầu thực dụng; Q5_K_M hoặc Q6_K tốt hơn nếu còn memory; Q8_0 gần INT8 nhưng nặng hơn.

AWQ và GPTQ thường xuất hiện trong GPU serving. Chúng không tự nhiên tốt hơn GGUF; chúng tốt khi runtime của bạn tối ưu cho chúng.

6. KV Cache Là Gì Và Vì Sao Nó Quan Trọng

Transformer sinh text theo kiểu autoregressive: mỗi token mới phụ thuộc vào các token trước đó. Để không tính lại toàn bộ context ở mỗi bước, runtime lưu key/value của attention vào KV cache.

Rough formula:

kv_cache_bytes ~= layers * 2 * kv_heads * head_dim * seq_len * concurrent_sequences * bytes_per_kv_element

Ý nghĩa:

layers: model càng sâu, KV cache càng lớn.
2: có key và value.
kv_heads: GQA/MQA giảm số KV heads nên tiết kiệm memory.
seq_len: prompt dài và output dài đều làm cache tăng.
concurrent_sequences: concurrency càng cao, cache càng lớn.
bytes_per_kv_element: FP16/BF16 thường 2 bytes; một số runtime hỗ trợ KV cache quantization.

Ví dụ intuition:

Một model 7B INT4 có thể chỉ cần ~5GB weights,
nhưng với context dài và nhiều request song song,
KV cache + overhead có thể làm GPU 8GB OOM.

Production implication:

Đặt max_context, max_prompt_tokens, max_tokens và max_concurrency.
Đừng bật context 32k/128k chỉ vì model hỗ trợ nếu product không cần.
Đo p95/p99 latency dưới prompt dài, không chỉ prompt ngắn.
Theo dõi memory peak khi concurrency tăng.

7. VRAM Estimation Thực Tế

Ước lượng deployment cần tính đủ bốn phần:

required_memory =
  weights_memory
  + kv_cache_memory
  + runtime_overhead
  + fragmentation_and_safety_margin

Checklist estimate:

Chọn model size và quantization.
Tính weights memory rough.
Ước lượng context length và concurrency.
Tính KV cache hoặc dùng memory profiling của runtime.
Thêm overhead 10-30% tùy runtime.
Chạy warmup và benchmark thật.

Ví dụ quyết định:

Hardware	Model khả thi ban đầu	Gợi ý
Laptop 16GB RAM, không GPU	3B/7B GGUF Q4	Dùng Ollama/llama.cpp, context vừa phải
GPU 8GB VRAM	7B INT4, context thấp-medium	Giới hạn concurrency, đo OOM
GPU 16GB VRAM	7B FP16 hoặc 13B INT4	Nếu task khó, thử 7B FP16 trước
GPU 24GB VRAM	13B FP16 hoặc 30B+ INT4 tùy runtime	Đo throughput/p95 nghiêm túc
Multi-GPU	vLLM/TGI/TensorRT-LLM	Cần ops, batching, monitoring

8. Throughput Vs Quality

Latency là thời gian một request. Throughput là tổng lượng xử lý trên một đơn vị thời gian, ví dụ requests/sec hoặc output tokens/sec. Quality là độ đúng của kết quả.

Ba thứ này thường kéo nhau:

Model lớn hơn: quality có thể tốt hơn, latency/memory/cost tăng.
Quantization thấp hơn: memory giảm, quality có thể giảm, speed phụ thuộc kernel.
Batch lớn hơn: throughput tăng, latency từng request có thể tăng.
Context dài hơn: answer có thể đủ thông tin hơn, prefill chậm và KV cache tăng.
Concurrency cao hơn: GPU utilization tốt hơn, p95/p99 có thể xấu hơn.

Metric tối thiểu:

Metric	Cần vì
TTFT	User cảm nhận phản hồi đầu tiên
Total latency	Request hoàn tất mất bao lâu
Output tokens/sec	Decode speed
Requests/sec	API throughput
RAM/VRAM peak	Có fit phần cứng không
Error rate	Timeout, OOM, 5xx, schema fail
Format accuracy	JSON/tool args có còn đúng không
Task score	Quality thật trên golden set

9. Deploy Architecture

Không nên để product app gọi thẳng local runtime nếu bạn cần production control.

Client / Product service
  -> FastAPI Local Model Gateway
      -> auth / API key / tenant policy
      -> request validation
      -> prompt template version
      -> timeout and max_tokens
      -> concurrency limiter
      -> structured logging / trace_id
      -> readiness check
      -> LocalLLMClient
          -> Ollama / llama.cpp server / vLLM / TGI
  -> Response

FastAPI gateway thêm giá trị:

API contract ổn định dù runtime bên dưới thay đổi.
Có thể enforce limit theo tenant.
Có nơi log metric và redaction.
Có timeout/fallback chuẩn.
Có endpoint /health và /ready cho orchestrator.
Có thể thêm canary, model routing, A/B test.

Theo docs FastAPI hiện tại, nên dùng Pydantic model cho validation/response filtering, response_model cho OpenAPI contract, HTTPException cho lỗi có kiểm soát, và lifespan cho startup/shutdown hoặc readiness state.

10. FastAPI Template Gần Production

Template đầy đủ hơn nằm trong document.md. Đây là skeleton quan trọng:

from __future__ import annotations

import asyncio
import json
import logging
import os
import time
import uuid
from contextlib import asynccontextmanager
from typing import Any

import httpx
import psutil
from fastapi import FastAPI, HTTPException, Request
from pydantic import BaseModel, Field

logger = logging.getLogger("local_model_api")
logging.basicConfig(level=os.getenv("LOG_LEVEL", "INFO"))


class Settings(BaseModel):
    runtime_base_url: str = os.getenv("LOCAL_LLM_BASE_URL", "http://localhost:11434/v1")
    api_key: str = os.getenv("LOCAL_LLM_API_KEY", "local")
    model: str = os.getenv("LOCAL_LLM_MODEL", "llama3.2")
    runtime: str = os.getenv("LOCAL_LLM_RUNTIME", "ollama")
    request_timeout_s: float = float(os.getenv("REQUEST_TIMEOUT_S", "60"))
    max_concurrency: int = int(os.getenv("MAX_CONCURRENCY", "4"))


settings = Settings()
semaphore = asyncio.Semaphore(settings.max_concurrency)
ready_state = {"ready": False, "last_error": None}


class ChatRequest(BaseModel):
    message: str = Field(min_length=1, max_length=8000)
    system: str = Field(default="You are a concise internal assistant.", max_length=2000)
    temperature: float = Field(default=0.2, ge=0.0, le=1.5)
    max_tokens: int = Field(default=512, ge=1, le=2048)


class ChatResponse(BaseModel):
    answer: str
    model: str
    runtime: str
    latency_ms: float
    memory_rss_mb: float
    trace_id: str


async def call_openai_compatible(req: ChatRequest) -> str:
    headers = {"Authorization": f"Bearer {settings.api_key}"}
    payload: dict[str, Any] = {
        "model": settings.model,
        "messages": [
            {"role": "system", "content": req.system},
            {"role": "user", "content": req.message},
        ],
        "temperature": req.temperature,
        "max_tokens": req.max_tokens,
    }
    timeout = httpx.Timeout(settings.request_timeout_s)
    async with httpx.AsyncClient(timeout=timeout) as client:
        response = await client.post(
            f"{settings.runtime_base_url.rstrip('/')}/chat/completions",
            headers=headers,
            json=payload,
        )
        response.raise_for_status()
        data = response.json()
        return data["choices"][0]["message"]["content"] or ""


@asynccontextmanager
async def lifespan(app: FastAPI):
    try:
        async with httpx.AsyncClient(timeout=5) as client:
            await client.get(f"{settings.runtime_base_url.rstrip('/')}/models")
        ready_state["ready"] = True
    except Exception as exc:
        ready_state["last_error"] = str(exc)
        logger.warning(json.dumps({"event": "model_runtime_not_ready", "error": str(exc)}))
    yield


app = FastAPI(title="Local Model API", version="1.0.0", lifespan=lifespan)


@app.get("/health")
async def health():
    return {"status": "ok"}


@app.get("/ready")
async def ready():
    if not ready_state["ready"]:
        raise HTTPException(status_code=503, detail=ready_state)
    return {"status": "ready", "model": settings.model, "runtime": settings.runtime}


@app.post("/chat", response_model=ChatResponse)
async def chat(req: ChatRequest, request: Request):
    trace_id = request.headers.get("x-request-id", str(uuid.uuid4()))
    start = time.perf_counter()

    try:
        async with semaphore:
            answer = await asyncio.wait_for(
                call_openai_compatible(req),
                timeout=settings.request_timeout_s + 2,
            )
    except TimeoutError as exc:
        raise HTTPException(status_code=504, detail={"trace_id": trace_id, "error": "timeout"}) from exc
    except httpx.HTTPStatusError as exc:
        raise HTTPException(status_code=502, detail={"trace_id": trace_id, "error": str(exc)}) from exc
    except Exception as exc:
        raise HTTPException(status_code=500, detail={"trace_id": trace_id, "error": type(exc).__name__}) from exc

    latency_ms = (time.perf_counter() - start) * 1000
    memory_rss_mb = psutil.Process().memory_info().rss / 1024 / 1024

    logger.info(
        json.dumps(
            {
                "event": "chat_completed",
                "trace_id": trace_id,
                "model": settings.model,
                "runtime": settings.runtime,
                "latency_ms": round(latency_ms, 2),
                "input_chars": len(req.message),
                "output_chars": len(answer),
                "memory_rss_mb": round(memory_rss_mb, 2),
            },
            ensure_ascii=False,
        )
    )

    return ChatResponse(
        answer=answer,
        model=settings.model,
        runtime=settings.runtime,
        latency_ms=round(latency_ms, 2),
        memory_rss_mb=round(memory_rss_mb, 2),
        trace_id=trace_id,
    )

Đây vẫn là gateway mẫu, chưa phải toàn bộ platform. Production thật cần thêm auth, rate limit, redaction, metrics exporter, container health policy, canary và deployment manifest.

11. Benchmark Latency Và Memory

Benchmark phải tách ít nhất ba nhóm:

Prompt ngắn, output ngắn: đo baseline overhead.
Prompt dài, output ngắn: đo prefill.
Prompt vừa, output dài: đo decode.

Benchmark tối thiểu:

pip install -U httpx psutil
python benchmark_local_api.py --url http://localhost:9000/chat --concurrency 4 --repeat 5

Khi ghi kết quả, luôn ghi kèm:

Model id và revision.
Quantization format: ví dụ GGUF Q4_K_M, AWQ INT4, GPTQ INT4, FP16.
Runtime và version.
Hardware: CPU, RAM, GPU, VRAM.
Context length, max output tokens, concurrency.
p50, p95, p99, error rate, output tokens/sec nếu đo được.
RAM/VRAM peak sau warmup.

12. Production Decision: Dùng Được Không

Có, local quantized model API có thể dùng trong production, nhưng chỉ khi đáp ứng các điều kiện sau:

Quality đạt ngưỡng trên golden set thật của sản phẩm, không chỉ test prompt thủ công.
Có baseline so sánh với FP16/BF16 hoặc hosted model mạnh hơn.
Latency p95/p99 đạt SLA dưới traffic và prompt length thực tế.
Memory không OOM sau warmup, concurrency test và context dài.
API gateway có validation, timeout, concurrency limit, health/readiness, structured logging và metric.
Có fallback hoặc degradation path khi local runtime timeout/OOM/crash.
Có review license của base model và quantized checkpoint.
Có policy không log PII/secrets/raw prompt nhạy cảm.
Có rollout plan: canary, rollback, model version pinning.

Không nên gọi là production nếu chỉ chạy được local demo, chưa có eval, chưa có monitoring, chưa có timeout, chưa có memory test và chưa có rollback.

13. Trade-off Và Best Solution Theo Context

Context	Best starting solution	Vì sao	Cần tránh
Dev local, học tập, demo nội bộ	Ollama hoặc llama.cpp + GGUF Q4_K_M/Q5_K_M	Setup nhanh, ít ops	Đừng suy ra production throughput từ laptop demo
Internal RAG nhỏ, dữ liệu private	FastAPI gateway + Ollama/llama.cpp + model 7B/8B quantized	Đủ kiểm soát privacy, chi phí thấp	Context quá dài, không có eval citation
API production traffic vừa, có GPU	vLLM + AWQ/GPTQ hoặc FP16 + FastAPI gateway	Throughput tốt hơn, batching tốt hơn	Runtime không hỗ trợ quant format
Quality-sensitive reasoning/code	FP16/BF16 model tốt hơn, chỉ quantize sau eval	Giảm regression	Chọn INT4 chỉ vì tiết kiệm VRAM
Edge/offline	GGUF Q4/Q5, prompt ngắn, task hẹp	Fit phần cứng	Hứa SLA như cloud model lớn
Cost optimization cho task hẹp	Distill/fine-tune model nhỏ + INT8/INT4	Rẻ và nhanh nếu task ổn định	Bỏ qua drift và quality monitoring

14. Checklist

15. Quiz Nhanh

Vì sao model 7B INT4 vẫn có thể OOM trên GPU 8GB khi context dài?
FP16 và BF16 cùng 2 bytes, khác nhau ở điểm nào quan trọng?
Khi nào nên chọn INT8 thay vì INT4?
GGUF phù hợp nhất với runtime nào?
Vì sao benchmark phải ghi cả model revision, runtime version và hardware?
/health và /ready khác nhau thế nào?
FastAPI gateway thêm giá trị gì nếu Ollama/vLLM đã có API?
Quality regression của quantization nên đo bằng gì?
Batching giúp throughput nhưng có thể làm xấu metric nào?
Điều kiện tối thiểu để local model API được dùng trong production là gì?

Tài liệu

1. Decision Framework

Quyết định quantization nên đi theo thứ tự này:

1. Xác định task và SLA
2. Chọn baseline model FP16/BF16 hoặc hosted model mạnh
3. Tạo golden set và metric
4. Chọn runtime theo hardware
5. Thử quantization theo memory target
6. Benchmark latency + throughput + memory
7. Eval quality regression
8. Quyết định production/canary/rollback

Nếu bỏ qua bước 2 và 3, bạn chỉ đang tối ưu chi phí mà không biết mình mất gì.

2. Runtime Và Quantization Matrix

Runtime	Format phổ biến	Mạnh ở đâu	Production concern
Ollama	GGUF	Dev local, internal service nhỏ, API đơn giản	Kiểm soát batching/throughput hạn chế hơn vLLM
llama.cpp server	GGUF	CPU, Apple Silicon, edge, GPU offload	Cần tự build API/ops nhiều hơn
vLLM	FP16/BF16, AWQ/GPTQ tùy version/model	Throughput GPU, continuous batching	Cần GPU, setup và capacity planning
TGI	FP16/BF16, quantized tùy backend	HuggingFace ecosystem	Ops phức tạp hơn demo local
TensorRT-LLM	FP16/BF16/INT8/INT4 tùy pipeline	Tối ưu NVIDIA production	Build/deploy phức tạp

Best solution không cố định:

Laptop/offline: GGUF + llama.cpp/Ollama.
GPU production throughput: vLLM hoặc TGI.
NVIDIA optimization sâu: TensorRT-LLM.
Quality-sensitive: bắt đầu bằng FP16/BF16, quantize sau khi có eval.

3. VRAM Estimation Cheat Sheet

Weights

weights_gb ~= params_billion * bytes_per_param

Precision	Bytes/param	7B rough	13B rough
FP32	4	28GB	52GB
FP16/BF16	2	14GB	26GB
INT8	1	7GB	13GB
INT4	0.5 + overhead	4-5.5GB	7.5-10GB

KV Cache

kv_cache_bytes ~= layers * 2 * kv_heads * head_dim * seq_len * concurrent_sequences * bytes

Ví dụ cách nghĩ:

Nếu tăng max context từ 4k lên 16k,
KV cache tăng xấp xỉ 4 lần.

Nếu tăng concurrency từ 2 lên 8,
KV cache tăng xấp xỉ 4 lần.

Safety Margin

Thêm margin vì:

CUDA graph/cache/runtime allocation.
Tokenizer và HTTP process memory.
Fragmentation.
Batch scheduler.
Log/metrics buffer.
Framework overhead.

Rule thực dụng: nếu estimate là 7.5GB trên GPU 8GB, xem như không đủ. Hãy giảm context, giảm concurrency, đổi quantization hoặc chọn model nhỏ hơn.

4. FastAPI Gateway Template

Template này dùng OpenAI-compatible endpoint, phù hợp với Ollama /v1, llama.cpp server OpenAI-compatible, vLLM OpenAI server hoặc TGI-compatible adapter nếu có.

Install

pip install -U fastapi uvicorn httpx pydantic psutil

Run

export LOCAL_LLM_BASE_URL=http://localhost:11434/v1
export LOCAL_LLM_API_KEY=local
export LOCAL_LLM_MODEL=llama3.2
export LOCAL_LLM_RUNTIME=ollama
export REQUEST_TIMEOUT_S=60
export MAX_CONCURRENCY=4

uvicorn app:app --host 0.0.0.0 --port 9000

`app.py`

from __future__ import annotations

import asyncio
import json
import logging
import os
import time
import uuid
from contextlib import asynccontextmanager
from typing import Any

import httpx
import psutil
from fastapi import FastAPI, HTTPException, Request
from pydantic import BaseModel, Field


logging.basicConfig(
    level=os.getenv("LOG_LEVEL", "INFO"),
    format="%(asctime)s %(levelname)s %(name)s %(message)s",
)
logger = logging.getLogger("local_model_api")


class Settings(BaseModel):
    runtime_base_url: str = os.getenv("LOCAL_LLM_BASE_URL", "http://localhost:11434/v1")
    api_key: str = os.getenv("LOCAL_LLM_API_KEY", "local")
    model: str = os.getenv("LOCAL_LLM_MODEL", "llama3.2")
    runtime: str = os.getenv("LOCAL_LLM_RUNTIME", "ollama")
    request_timeout_s: float = float(os.getenv("REQUEST_TIMEOUT_S", "60"))
    max_concurrency: int = int(os.getenv("MAX_CONCURRENCY", "4"))
    max_input_chars: int = int(os.getenv("MAX_INPUT_CHARS", "8000"))
    max_output_tokens: int = int(os.getenv("MAX_OUTPUT_TOKENS", "2048"))


settings = Settings()
semaphore = asyncio.Semaphore(settings.max_concurrency)
ready_state: dict[str, Any] = {"ready": False, "last_error": None}


class ChatRequest(BaseModel):
    message: str = Field(min_length=1)
    system: str = Field(default="You are a concise internal assistant.", max_length=2000)
    temperature: float = Field(default=0.2, ge=0.0, le=1.5)
    max_tokens: int = Field(default=512, ge=1)


class ChatResponse(BaseModel):
    answer: str
    model: str
    runtime: str
    latency_ms: float
    memory_rss_mb: float
    trace_id: str


class ErrorResponse(BaseModel):
    trace_id: str
    error: str


def validate_limits(req: ChatRequest) -> None:
    if len(req.message) > settings.max_input_chars:
        raise HTTPException(status_code=413, detail="message too large")
    if req.max_tokens > settings.max_output_tokens:
        raise HTTPException(status_code=422, detail="max_tokens exceeds server limit")


async def call_model(req: ChatRequest) -> str:
    headers = {"Authorization": f"Bearer {settings.api_key}"}
    payload: dict[str, Any] = {
        "model": settings.model,
        "messages": [
            {"role": "system", "content": req.system},
            {"role": "user", "content": req.message},
        ],
        "temperature": req.temperature,
        "max_tokens": req.max_tokens,
    }

    async with httpx.AsyncClient(timeout=httpx.Timeout(settings.request_timeout_s)) as client:
        response = await client.post(
            f"{settings.runtime_base_url.rstrip('/')}/chat/completions",
            headers=headers,
            json=payload,
        )
        response.raise_for_status()
        data = response.json()

    return data["choices"][0]["message"]["content"] or ""


@asynccontextmanager
async def lifespan(app: FastAPI):
    try:
        async with httpx.AsyncClient(timeout=5) as client:
            response = await client.get(f"{settings.runtime_base_url.rstrip('/')}/models")
            response.raise_for_status()
        ready_state["ready"] = True
    except Exception as exc:
        ready_state["last_error"] = str(exc)
        logger.warning(json.dumps({"event": "model_runtime_not_ready", "error": str(exc)}))

    yield

    ready_state["ready"] = False


app = FastAPI(
    title="Local Model API",
    version="1.0.0",
    lifespan=lifespan,
)


@app.get("/health")
async def health() -> dict[str, str]:
    return {"status": "ok"}


@app.get("/ready")
async def ready() -> dict[str, Any]:
    if not ready_state["ready"]:
        raise HTTPException(status_code=503, detail=ready_state)
    return {
        "status": "ready",
        "model": settings.model,
        "runtime": settings.runtime,
        "max_concurrency": settings.max_concurrency,
    }


@app.post(
    "/chat",
    response_model=ChatResponse,
    responses={502: {"model": ErrorResponse}, 504: {"model": ErrorResponse}},
)
async def chat(req: ChatRequest, request: Request) -> ChatResponse:
    validate_limits(req)

    trace_id = request.headers.get("x-request-id", str(uuid.uuid4()))
    start = time.perf_counter()

    try:
        async with semaphore:
            answer = await asyncio.wait_for(
                call_model(req),
                timeout=settings.request_timeout_s + 2,
            )
    except TimeoutError as exc:
        raise HTTPException(
            status_code=504,
            detail={"trace_id": trace_id, "error": "model timeout"},
        ) from exc
    except httpx.HTTPStatusError as exc:
        raise HTTPException(
            status_code=502,
            detail={"trace_id": trace_id, "error": f"runtime returned {exc.response.status_code}"},
        ) from exc
    except Exception as exc:
        logger.exception(json.dumps({"event": "chat_failed", "trace_id": trace_id}))
        raise HTTPException(
            status_code=500,
            detail={"trace_id": trace_id, "error": type(exc).__name__},
        ) from exc

    latency_ms = (time.perf_counter() - start) * 1000
    memory_rss_mb = psutil.Process().memory_info().rss / 1024 / 1024

    logger.info(
        json.dumps(
            {
                "event": "chat_completed",
                "trace_id": trace_id,
                "model": settings.model,
                "runtime": settings.runtime,
                "latency_ms": round(latency_ms, 2),
                "input_chars": len(req.message),
                "output_chars": len(answer),
                "max_tokens": req.max_tokens,
                "memory_rss_mb": round(memory_rss_mb, 2),
            },
            ensure_ascii=False,
        )
    )

    return ChatResponse(
        answer=answer,
        model=settings.model,
        runtime=settings.runtime,
        latency_ms=round(latency_ms, 2),
        memory_rss_mb=round(memory_rss_mb, 2),
        trace_id=trace_id,
    )

5. Benchmark Script

`benchmark_local_api.py`

from __future__ import annotations

import argparse
import asyncio
import statistics
import time
from dataclasses import dataclass

import httpx
import psutil


PROMPTS = [
    "Tóm tắt local LLM trong 3 bullet.",
    "So sánh INT8 và INT4 cho production serving.",
    "Trả lời JSON với keys: risks, metrics, rollback.",
    "Giải thích vì sao KV cache tăng theo context length.",
]


@dataclass
class Result:
    latency_ms: float
    ok: bool
    output_chars: int
    error: str | None = None


async def one_request(client: httpx.AsyncClient, url: str, prompt: str, max_tokens: int) -> Result:
    start = time.perf_counter()
    try:
        response = await client.post(
            url,
            json={"message": prompt, "max_tokens": max_tokens},
        )
        elapsed_ms = (time.perf_counter() - start) * 1000
        response.raise_for_status()
        data = response.json()
        return Result(latency_ms=elapsed_ms, ok=True, output_chars=len(data.get("answer", "")))
    except Exception as exc:
        elapsed_ms = (time.perf_counter() - start) * 1000
        return Result(latency_ms=elapsed_ms, ok=False, output_chars=0, error=type(exc).__name__)


async def run(url: str, concurrency: int, repeat: int, timeout_s: float, max_tokens: int) -> None:
    prompts = (PROMPTS * repeat)[: len(PROMPTS) * repeat]
    limits = httpx.Limits(max_connections=concurrency, max_keepalive_connections=concurrency)

    async with httpx.AsyncClient(timeout=timeout_s, limits=limits) as client:
        pending = []
        for prompt in prompts:
            pending.append(one_request(client, url, prompt, max_tokens))
            if len(pending) == concurrency:
                yield_results = await asyncio.gather(*pending)
                for item in yield_results:
                    results.append(item)
                pending = []
        if pending:
            for item in await asyncio.gather(*pending):
                results.append(item)


def percentile(values: list[float], pct: float) -> float:
    if not values:
        return 0.0
    ordered = sorted(values)
    index = max(0, min(len(ordered) - 1, int(len(ordered) * pct) - 1))
    return ordered[index]


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--url", default="http://localhost:9000/chat")
    parser.add_argument("--concurrency", type=int, default=4)
    parser.add_argument("--repeat", type=int, default=5)
    parser.add_argument("--timeout-s", type=float, default=90)
    parser.add_argument("--max-tokens", type=int, default=300)
    args = parser.parse_args()

    results: list[Result] = []
    process = psutil.Process()
    rss_before_mb = process.memory_info().rss / 1024 / 1024
    started = time.perf_counter()

    asyncio.run(run(args.url, args.concurrency, args.repeat, args.timeout_s, args.max_tokens))

    total_s = time.perf_counter() - started
    rss_after_mb = process.memory_info().rss / 1024 / 1024
    ok_latencies = [r.latency_ms for r in results if r.ok]
    errors = [r.error for r in results if not r.ok]

    print(
        {
            "count": len(results),
            "ok": len(ok_latencies),
            "error_count": len(errors),
            "error_types": sorted(set(e for e in errors if e)),
            "total_s": round(total_s, 2),
            "requests_per_s": round(len(results) / total_s, 2) if total_s else 0,
            "p50_ms": round(statistics.median(ok_latencies), 2) if ok_latencies else 0,
            "p95_ms": round(percentile(ok_latencies, 0.95), 2),
            "p99_ms": round(percentile(ok_latencies, 0.99), 2),
            "avg_ms": round(statistics.mean(ok_latencies), 2) if ok_latencies else 0,
            "avg_output_chars": round(statistics.mean([r.output_chars for r in results if r.ok]), 2)
            if ok_latencies
            else 0,
            "client_rss_before_mb": round(rss_before_mb, 2),
            "client_rss_after_mb": round(rss_after_mb, 2),
        }
    )

Lưu ý: script này đo memory của benchmark client, không đo VRAM của model server. Với GPU, ghi thêm nvidia-smi hoặc metric từ runtime.

6. Memory Measurement

CPU/RAM

ps -o pid,rss,comm -p <PID>
top -p <PID>

NVIDIA GPU

nvidia-smi --query-gpu=timestamp,name,memory.used,memory.total,utilization.gpu --format=csv -l 1

Ollama

ollama ps

Ghi memory ở ba thời điểm:

Trước khi gọi request.
Sau warmup.
Trong benchmark concurrency cao nhất.

7. Quality Evaluation Checklist

Golden set nên có:

Prompt tiếng Việt thực tế.
Prompt dài gần giới hạn context.
Structured output JSON.
Câu hỏi cần citation nếu dùng RAG.
Case từ chối trả lời nếu policy yêu cầu.
Case dễ hallucinate.
Case domain-specific.

Metric:

Exact/schema validity cho JSON.
Task accuracy cho classification/extraction.
Human rating cho long-form answer.
Citation correctness nếu RAG.
Regression count giữa FP16/BF16 và quantized.

Decision rule ví dụ:

Chấp nhận INT4 nếu:
- p95 latency giảm hoặc memory fit rõ ràng,
- format accuracy giảm không quá 1%,
- task accuracy giảm không quá 2%,
- không tăng critical hallucination,
- error rate < 0.5% dưới benchmark target.

Ngưỡng thật phải theo domain. Với medical/legal/finance, ngưỡng regression phải nghiêm hơn nhiều và thường cần human review.

8. Production Checklist

9. Câu Trả Lời Production

Dùng được trong production không? Có, nếu local model API được vận hành như một service production thật: contract rõ, benchmark rõ, quality eval rõ, capacity rõ, timeout rõ, monitoring rõ, license rõ và rollback rõ.

Không dùng được trong production nếu chỉ có một model quantized chạy được trên máy cá nhân. "Chạy được" khác với "chịu được traffic thật, lỗi có kiểm soát, chất lượng đo được và rollback được".

Bài tập

Mục Tiêu

Bạn sẽ tạo một FastAPI gateway cho local model, chạy benchmark latency/memory và viết quyết định production readiness.

Kết quả cần nộp:

app.py: FastAPI gateway.
benchmark_local_api.py: benchmark script.
results.md: bảng kết quả latency, memory, quality note.
production_decision.md: trả lời "Dùng được trong production không? Nếu có thì cần điều kiện gì?"

Phần 1: Chuẩn Bị Runtime

Chọn một trong các runtime:

Option A: Ollama

ollama pull llama3.2
ollama serve

Gateway config:

export LOCAL_LLM_BASE_URL=http://localhost:11434/v1
export LOCAL_LLM_API_KEY=local
export LOCAL_LLM_MODEL=llama3.2
export LOCAL_LLM_RUNTIME=ollama

Option B: llama.cpp server

Ví dụ concept:

./llama-server -m model.gguf --host 0.0.0.0 --port 8080 --ctx-size 4096

Gateway config:

export LOCAL_LLM_BASE_URL=http://localhost:8080/v1
export LOCAL_LLM_API_KEY=local
export LOCAL_LLM_MODEL=local-gguf
export LOCAL_LLM_RUNTIME=llama.cpp

Option C: vLLM OpenAI server

Ví dụ concept:

python -m vllm.entrypoints.openai.api_server \
  --model <model-id> \
  --host 0.0.0.0 \
  --port 8000

Gateway config:

export LOCAL_LLM_BASE_URL=http://localhost:8000/v1
export LOCAL_LLM_API_KEY=local
export LOCAL_LLM_MODEL=<model-id>
export LOCAL_LLM_RUNTIME=vllm

Phần 2: Tạo FastAPI Gateway

Cài dependency:

pip install -U fastapi uvicorn httpx pydantic psutil

Tạo app.py dựa theo template trong document.md.

Yêu cầu bắt buộc:

POST /chat nhận message, system, temperature, max_tokens.
Response có answer, model, runtime, latency_ms, memory_rss_mb, trace_id.
GET /health trả process health.
GET /ready kiểm tra runtime/model readiness.
Có timeout.
Có concurrency limit.
Có max input length và max output tokens.
Có structured log, không log raw prompt.

Chạy:

uvicorn app:app --host 0.0.0.0 --port 9000

Smoke test:

curl http://localhost:9000/health
curl http://localhost:9000/ready
curl -X POST http://localhost:9000/chat \
  -H 'content-type: application/json' \
  -d '{"message":"Giải thích INT4 trong 3 bullet.","max_tokens":200}'

Phần 3: Benchmark Latency

Tạo benchmark_local_api.py theo template trong document.md.

Chạy ba cấu hình:

python benchmark_local_api.py --concurrency 1 --repeat 5 --max-tokens 200
python benchmark_local_api.py --concurrency 4 --repeat 5 --max-tokens 300
python benchmark_local_api.py --concurrency 8 --repeat 5 --max-tokens 300

Nếu máy yếu, giảm concurrency xuống 1, 2, 4.

Ghi vào results.md:

Config	p50 ms	p95 ms	p99 ms	req/s	error rate	RAM/VRAM peak	Note
concurrency=1
concurrency=4
concurrency=8

Phần 4: Benchmark Memory

Trước benchmark:

ollama ps
nvidia-smi

Trong benchmark:

nvidia-smi --query-gpu=timestamp,name,memory.used,memory.total,utilization.gpu --format=csv -l 1

Nếu không có GPU:

ps -o pid,rss,comm -p <PID>
top -p <PID>

Ghi:

RAM trước warmup.
RAM sau warmup.
RAM/VRAM peak khi concurrency cao nhất.
Có OOM hoặc timeout không.

Phần 5: Quality Regression Mini Eval

Tạo 10 prompt thật cho use case của bạn:

3 prompt hỏi đáp tiếng Việt.
2 prompt yêu cầu JSON.
2 prompt dài gần context thực tế.
1 prompt domain-specific.
1 prompt dễ hallucinate.
1 prompt yêu cầu từ chối nếu thiếu dữ kiện.

Nếu có hai model/format, ví dụ FP16 vs INT4 hoặc Q4 vs Q5, chạy cùng 10 prompt và chấm:

Prompt	Baseline pass/fail	Quantized pass/fail	Lỗi nếu fail
1
2

Quality note cần trả lời:

JSON có parse được không?
Câu trả lời tiếng Việt có tự nhiên không?
Có hallucination nghiêm trọng không?
Có sai instruction hoặc vượt policy không?
Regression có chấp nhận được với use case không?

Phần 6: Production Decision

Tạo production_decision.md theo mẫu:

# Production Decision

## Context

- Use case:
- Traffic target:
- SLA:
- Hardware:
- Runtime:
- Model:
- Quantization:

## Benchmark Summary

- p50:
- p95:
- p99:
- req/s:
- RAM/VRAM peak:
- error rate:

## Quality Summary

- Golden set size:
- Pass rate baseline:
- Pass rate quantized:
- Regression:

## Decision

Dùng được trong production không?

## Điều kiện

- Điều kiện 1:
- Điều kiện 2:
- Điều kiện 3:

## Rollback/Fallback

- Khi timeout:
- Khi OOM:
- Khi quality regression:

Gợi ý quyết định:

Nếu p95 không đạt SLA: chưa production, cần model nhỏ hơn, quantization khác, runtime khác hoặc GPU tốt hơn.
Nếu memory sát giới hạn: chưa production cho traffic thật, cần giảm context/concurrency hoặc tăng hardware.
Nếu quality regression cao: chưa production, cần model tốt hơn, INT8/Q5/Q6 thay vì INT4, hoặc fallback.
Nếu chỉ thiếu observability/auth/rate limit: có thể canary nội bộ, chưa public production.

Mục Tiêu

TL;DR

1. Bài Này Nằm Ở Đâu Trong Lộ Trình

2. Mental Model: Precision Là Gì

3. FP32, FP16, BF16 Step By Step

FP32

FP16

BF16

4. INT8 Và INT4 Step By Step

INT8

INT4

5. GGUF, AWQ, GPTQ

6. KV Cache Là Gì Và Vì Sao Nó Quan Trọng

7. VRAM Estimation Thực Tế

8. Throughput Vs Quality

9. Deploy Architecture

10. FastAPI Template Gần Production

11. Benchmark Latency Và Memory

12. Production Decision: Dùng Được Không

13. Trade-off Và Best Solution Theo Context

14. Checklist

15. Quiz Nhanh

Tài liệu

1. Decision Framework

2. Runtime Và Quantization Matrix

3. VRAM Estimation Cheat Sheet

Weights

KV Cache

Safety Margin

4. FastAPI Gateway Template

Install

Run

app.py

5. Benchmark Script

benchmark_local_api.py

6. Memory Measurement

CPU/RAM

NVIDIA GPU

Ollama

7. Quality Evaluation Checklist

8. Production Checklist

9. Câu Trả Lời Production

Bài tập

Mục Tiêu

Phần 1: Chuẩn Bị Runtime

Option A: Ollama

Option B: llama.cpp server

Option C: vLLM OpenAI server

Phần 2: Tạo FastAPI Gateway

Phần 3: Benchmark Latency

Phần 4: Benchmark Memory

Phần 5: Quality Regression Mini Eval

Phần 6: Production Decision

Checklist Tự Chấm

`app.py`

`benchmark_local_api.py`