Day 27: LoRA/QLoRA Hands-on

Mục Tiêu

Sau bài này, bạn cần làm được các việc sau:

Hiểu PEFT, LoRA và QLoRA ở mức đủ để ra quyết định kỹ thuật.
Fine-tune một causal language model nhỏ bằng Hugging Face transformers, datasets, peft, trl, accelerate và bitsandbytes.
Chọn được r, lora_alpha, target_modules, lora_dropout, batch size, gradient accumulation và max_length theo VRAM/cost/performance.
Biết lưu adapter, load adapter, chạy inference sanity check và merge LoRA weights khi cần single artifact.
Biết khi nào nên dùng LoRA, QLoRA, full fine-tuning, prompt engineering hoặc RAG.
Trả lời rõ: dùng được trong production không, nếu có thì cần điều kiện gì.

TL;DR

LoRA không train lại toàn bộ model. Nó freeze base model và chỉ train các low-rank adapter nhỏ gắn vào một số linear layer. QLoRA đi thêm một bước: base model được load ở 4-bit quantization để giảm VRAM, còn adapter LoRA vẫn được train ở precision phù hợp như bf16/fp16.

Với production, LoRA/QLoRA phù hợp khi bạn muốn model ổn định hơn về format, tone, workflow hoặc domain behavior. Không nên fine-tune để nhét knowledge thay đổi thường xuyên; trường hợp đó thường hợp với RAG hoặc tool calling hơn.

1. Bài Này Nằm Ở Đâu Trong Phase 4

Day 25: quyết định khi nào fine-tune, khi nào dùng RAG
Day 26: chuẩn bị dataset instruction tuning
Day 27: chạy LoRA/QLoRA hands-on
Day 28: evaluate trước/sau fine-tune
Day 29-30: local LLM và deploy

Day 27 không chỉ là "chạy được training". Mục tiêu đúng là tạo được một training pipeline có thể kiểm soát: dataset rõ schema, training config có seed, adapter artifact có metadata, inference test chạy được, và biết trade-off trước khi merge/deploy.

2. Problem Framing

Bài toán hands-on:

Input: instruction của user trong domain customer support
Output: câu trả lời ngắn, đúng JSON format, đúng tone, có next action

Ví dụ output mong muốn:

{
  "category": "billing",
  "priority": "high",
  "answer": "Mình đã ghi nhận vấn đề bị tính phí hai lần. Vui lòng cung cấp mã giao dịch để mình kiểm tra và hoàn tiền nếu phát sinh lỗi."
}

Trước khi train, phải chốt các câu hỏi sau:

Fine-tune để sửa behavior nào: format JSON, tone, classification label, policy wording hay workflow?
Baseline hiện tại fail ra sao, tần suất bao nhiêu trên golden set?
Dataset đã tách train/validation/test chưa?
Output có schema parse được không?
Facts có thay đổi thường xuyên không? Nếu có, cần RAG/tool thay vì cố train.
Có đủ GPU cho model size, sequence length và batch size không?
Có ràng buộc license/commercial use/PII không?

3. PEFT Là Gì

PEFT là Parameter-Efficient Fine-Tuning: thay vì update toàn bộ parameter của model, ta chỉ update một phần rất nhỏ. LoRA là một kỹ thuật PEFT phổ biến cho LLM.

Full fine-tuning:

W_base -> update phần lớn hoặc toàn bộ weights

LoRA:

W_base frozen
W_runtime = W_base + delta_adapter
delta_adapter = B @ A, với rank thấp r

Ý nghĩa thực tế:

Base model giữ nguyên, dễ rollback.
Adapter nhỏ hơn base model rất nhiều, dễ version và upload.
Training nhanh và rẻ hơn full fine-tuning.
Capacity bị giới hạn bởi adapter nên không thay đổi sâu như full fine-tuning.

4. LoRA Config Step by Step

`r`: LoRA rank

r là rank của low-rank adapter. Rank càng cao, adapter càng có nhiều capacity, nhưng train nhiều parameter hơn.

Gợi ý thực tế:

Context	Gợi ý `r`	Lý do
Dataset nhỏ, format/tone đơn giản	4-8	Giảm overfit, tiết kiệm VRAM
Dataset vừa, task support/code writing hẹp	16	Default tốt để bắt đầu
Task phức tạp, nhiều style/domain	32-64	Tăng capacity, cần eval kỹ

Không chọn r cao chỉ vì "nghe mạnh hơn". Nếu validation loss xấu hơn, output drift hoặc adapter quá nặng, hãy giảm r.

`lora_alpha`

lora_alpha là scaling factor cho adapter. Cách nghĩ đơn giản:

adapter_effect ~= lora_alpha / r

Default thực tế hay dùng:

r=8, lora_alpha=16
r=16, lora_alpha=32
r=32, lora_alpha=64

Nếu output bị thay đổi quá mạnh, style quá cứng hoặc mất generality, giảm learning rate trước, sau đó cân nhắc giảm alpha.

`target_modules`

target_modules quyết định layer nào được gắn LoRA.

Lựa chọn phổ biến:

["q_proj", "v_proj"]: ít parameter, nhanh, thường đủ cho task nhẹ.
["q_proj", "k_proj", "v_proj", "o_proj"]: thêm attention output, capacity tốt hơn.
["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]: mạnh hơn cho LLaMA/Qwen-style models, tốn VRAM hơn.
"all-linear": tiện cho thử nghiệm, nhưng cần kiểm tra model architecture và VRAM.

Best solution theo context: bắt đầu với modules attention (q_proj, v_proj hoặc full attention projections), chỉ mở rộng sang MLP hoặc all-linear khi eval cho thấy adapter chưa đủ học behavior.

`lora_dropout`

lora_dropout giúp regularization, nhất là dataset nhỏ.

Gợi ý:

0.0: dataset lớn/sạch, muốn tối đa signal.
0.05: default tốt cho nhiều hands-on.
0.1: dataset nhỏ hoặc có dấu hiệu overfit.

Dropout cao quá có thể làm model học chậm và format không ổn định.

`bias`

Trong nhiều use case LoRA cho causal LM, bias="none" là lựa chọn tốt để giảm parameter trainable và giảm rủi ro thay đổi ngoài ý muốn.

5. QLoRA Và 4-bit Quantization

QLoRA dùng quantization để load base model ở 4-bit, thường với NF4 và double quantization:

BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

Mental model:

Base model: 4-bit, frozen, tiết kiệm VRAM
LoRA adapter: trainable, precision cao hơn
Optimizer: memory-efficient hơn so với full fine-tuning

Trade-off:

Lựa chọn	Ưu điểm	Nhược điểm
LoRA fp16/bf16	Nhanh hơn, ít complexity hơn	Cần nhiều VRAM hơn QLoRA
QLoRA 4-bit	Chạy được model lớn hơn trên GPU nhỏ	Phụ thuộc CUDA/bitsandbytes, có thể chậm hơn
Full fine-tuning	Capacity cao nhất	Đắt, khó rollback, cần nhiều data/GPU

Production note: QLoRA thường là training-time optimization. Khi serve, bạn có thể giữ adapter riêng hoặc merge adapter vào base model ở precision phù hợp, tùy serving stack.

6. Chọn Model Và Hardware

Model size	Hardware gợi ý	Ghi chú
0.5B-1B	CPU rất chậm hoặc GPU nhỏ	Tốt để học pipeline
1B-3B	Colab T4/A10, GPU 12-24GB	Tốt cho LoRA/QLoRA hands-on
7B-8B	GPU 16-24GB với QLoRA	Cần batch nhỏ, gradient accumulation
13B+	GPU lớn hoặc multi-GPU	Không phù hợp bài ngắn nếu chưa có infra

Model gợi ý cho bài học:

Qwen/Qwen2.5-0.5B-Instruct: nhanh, hợp để test pipeline.
Qwen/Qwen2.5-1.5B-Instruct: cân bằng hơn nếu GPU ổn.
LLaMA-compatible model 7B/8B: chỉ dùng khi license, GPU và thời gian cho phép.

Luôn đọc model card trước khi dùng: license, intended use, language support, safety, commercial use, trust_remote_code, context length và tokenizer behavior.

7. Training Pipeline

Pipeline chuẩn:

JSONL dataset
  -> validate schema
  -> train/validation split
  -> load tokenizer
  -> load base model
  -> optional 4-bit quantization
  -> prepare_model_for_kbit_training nếu QLoRA
  -> attach LoRA adapter
  -> train bằng SFTTrainer
  -> save adapter + tokenizer + metadata
  -> inference sanity check
  -> optional merge adapter
  -> compare before/after ở Day 28

Dataset nên dùng conversational messages để khớp chat model:

{"messages":[{"role":"user","content":"Khách bị tính phí 2 lần, cần trả lời sao?"},{"role":"assistant","content":"{\"category\":\"billing\",\"priority\":\"high\",\"answer\":\"Mình đã ghi nhận vấn đề bị tính phí hai lần. Vui lòng cung cấp mã giao dịch để mình kiểm tra và hoàn tiền nếu phát sinh lỗi.\"}"}]}

Validation tối thiểu:

Mỗi dòng là JSON object.
Có key messages.
messages là list không rỗng.
Mỗi message có role và content.
Role nằm trong system, user, assistant.
Có ít nhất một user và một assistant.
Assistant output parse được JSON nếu downstream yêu cầu JSON.
Không có PII thô như email, phone, token, card number nếu chưa được approval.

8. Colab Path Và Local GPU Path

Colab

Phù hợp khi bạn chưa có GPU local.

pip install -U torch transformers datasets accelerate peft trl bitsandbytes

Checklist Colab:

Runtime chọn GPU.
Chạy nvidia-smi để biết VRAM.
Dùng model 0.5B-1.5B trước.
Mount Google Drive nếu cần lưu artifact lâu dài.
Không upload dataset có PII lên notebook cá nhân nếu chưa được phép.

Local GPU

Phù hợp khi bạn cần kiểm soát data/privacy hoặc train lặp lại.

python -m venv .venv
source .venv/bin/activate
pip install -U torch transformers datasets accelerate peft trl bitsandbytes
accelerate config

Checklist local:

Driver/CUDA tương thích với PyTorch và bitsandbytes.
nvidia-smi thấy GPU.
Có disk đủ cho base model cache và artifact.
Log package versions để reproduce.

9. Performance, VRAM Và Cost

Các biến ảnh hưởng mạnh:

Model size: 7B tốn hơn rất nhiều so với 1.5B.
max_length: 2048 thường tốn hơn đáng kể so với 1024.
Batch size: tăng batch tăng VRAM.
Gradient accumulation: tăng effective batch mà không tăng VRAM tuyến tính.
target_modules: nhiều module trainable hơn thì tốn hơn.
QLoRA: giảm VRAM, có thể chậm hơn LoRA bf16/fp16.
Packing: tăng throughput với sequence ngắn, nhưng cần hiểu loss masking và dữ liệu.

Effective batch size:

effective_batch = per_device_train_batch_size * gradient_accumulation_steps * num_gpus

Cost rule:

Nếu prompt/RAG giải quyết được với latency/cost chấp nhận được, chưa cần fine-tune.
Nếu traffic cao và task hẹp, fine-tune model nhỏ có thể giảm inference cost.
Nếu dataset chưa sạch, GPU rẻ cũng không cứu được quality.

10. Merge Weights Hay Giữ Adapter Riêng

Cách serve	Nên dùng khi	Trade-off
Giữ adapter riêng	Cần rollback nhanh, A/B test, multi-domain adapters	Serving stack phải support adapter
Merge adapter vào base	Cần single artifact, runtime đơn giản, giảm overhead adapter	Artifact lớn hơn, mất linh hoạt swap adapter

Merge bằng merge_and_unload() tạo model thường không còn PEFT wrapper. Sau khi merge, luôn chạy lại sanity check và eval vì artifact đã khác đường load adapter.

11. Dùng Được Trong Production Không?

Có, LoRA/QLoRA dùng được trong production nếu các điều kiện sau được đáp ứng:

Có baseline và golden eval set trước khi train.
Dataset sạch, có quyền sử dụng, đã xử lý PII và có train/validation/test split.
Mục tiêu fine-tune là behavior/format/tone/workflow, không phải facts thay đổi liên tục.
Artifact được version đầy đủ: base model id, revision, tokenizer, LoRA config, seed, package versions, dataset version, training command, hardware.
Có inference sanity check, regression eval, safety eval và rollback plan.
License của base model, dataset và adapter cho phép use case production/commercial.
Serving path đã được benchmark về latency, throughput, VRAM/RAM và cost.
Có monitoring sau deploy: format accuracy, refusal/safety, hallucination proxy, user feedback, error rate và drift.

Không nên đưa vào production nếu chỉ có train loss giảm, chưa có eval độc lập hoặc chưa kiểm tra license/privacy.

12. Checklist Cuối Bài

Tài liệu

1. Recommended Project Layout

day27/
  data/
    support_sft.jsonl
  scripts/
    train_lora_sft.py
    infer_adapter.py
    merge_adapter.py
  artifacts/
    support-lora-v1/
      adapter_config.json
      adapter_model.safetensors
      tokenizer.json
      training_metadata.json

Trong repo học này, bạn có thể đặt script ở nơi bạn muốn. Điều quan trọng là artifact phải đủ thông tin để người khác reproduce.

2. Dataset Contract

Mỗi dòng JSONL:

{"messages":[{"role":"user","content":"Khách muốn hoàn tiền vì bị tính phí 2 lần."},{"role":"assistant","content":"{\"category\":\"billing\",\"priority\":\"high\",\"answer\":\"Mình đã ghi nhận yêu cầu hoàn tiền do bị tính phí hai lần. Vui lòng cung cấp mã giao dịch để mình kiểm tra và xử lý tiếp.\"}"}]}

Yêu cầu:

File là UTF-8 JSONL, một example mỗi dòng.
Mỗi example có messages.
Role hợp lệ: system, user, assistant.
Có ít nhất một user và một assistant.
Nếu assistant phải trả JSON, content phải parse được bằng json.loads.
Tách validation bằng seed cố định.
Không trộn example test vào train.

3. Training Script Gần Production

Script dưới đây ưu tiên tính rõ ràng và reproducibility. Với dataset lớn, hãy tách thành file .py thật, thêm logging/MLflow/W&B và evaluation script riêng ở Day 28.

from __future__ import annotations

import json
import os
import random
import re
from dataclasses import asdict, dataclass
from datetime import datetime, timezone
from pathlib import Path
from typing import Any

import torch
from datasets import Dataset
from peft import LoraConfig, prepare_model_for_kbit_training
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, set_seed
from trl import SFTConfig, SFTTrainer


@dataclass(frozen=True)
class TrainConfig:
    model_id: str = os.getenv("MODEL_ID", "Qwen/Qwen2.5-0.5B-Instruct")
    data_path: str = os.getenv("DATA_PATH", "data/day27_support_sft.jsonl")
    output_dir: str = os.getenv("OUT_DIR", "artifacts/day27_support_lora_v1")
    seed: int = int(os.getenv("SEED", "42"))
    use_qlora: bool = os.getenv("USE_QLORA", "1") == "1"
    max_length: int = int(os.getenv("MAX_LENGTH", "1024"))
    eval_ratio: float = float(os.getenv("EVAL_RATIO", "0.1"))
    num_train_epochs: float = float(os.getenv("EPOCHS", "1"))
    learning_rate: float = float(os.getenv("LR", "0.0002"))
    per_device_train_batch_size: int = int(os.getenv("BATCH_SIZE", "1"))
    gradient_accumulation_steps: int = int(os.getenv("GRAD_ACCUM", "8"))
    lora_r: int = int(os.getenv("LORA_R", "16"))
    lora_alpha: int = int(os.getenv("LORA_ALPHA", "32"))
    lora_dropout: float = float(os.getenv("LORA_DROPOUT", "0.05"))


EMAIL_RE = re.compile(r"[\w.+-]+@[\w-]+\.[\w.-]+")
PHONE_RE = re.compile(r"(\+?\d[\d .-]{8,}\d)")


def sample_rows() -> list[dict[str, Any]]:
    base = [
        {
            "messages": [
                {"role": "user", "content": "Khách bị tính phí 2 lần."},
                {
                    "role": "assistant",
                    "content": json.dumps(
                        {
                            "category": "billing",
                            "priority": "high",
                            "answer": "Mình đã ghi nhận vấn đề bị tính phí hai lần. Vui lòng cung cấp mã giao dịch để mình kiểm tra và hoàn tiền nếu phát sinh lỗi.",
                        },
                        ensure_ascii=False,
                    ),
                },
            ]
        },
        {
            "messages": [
                {"role": "user", "content": "Khách hỏi thời gian giao lại đơn bị giao thất bại."},
                {
                    "role": "assistant",
                    "content": json.dumps(
                        {
                            "category": "shipping",
                            "priority": "medium",
                            "answer": "Mình đã ghi nhận đơn giao thất bại. Vui lòng cung cấp mã đơn hàng để mình kiểm tra lịch giao lại gần nhất.",
                        },
                        ensure_ascii=False,
                    ),
                },
            ]
        },
    ]
    return base * 30


def load_jsonl(path: Path) -> list[dict[str, Any]]:
    if not path.exists():
        return sample_rows()

    rows: list[dict[str, Any]] = []
    for line_no, line in enumerate(path.read_text(encoding="utf-8").splitlines(), start=1):
        if not line.strip():
            continue
        try:
            rows.append(json.loads(line))
        except json.JSONDecodeError as exc:
            raise ValueError(f"{path}:{line_no} is not valid JSON") from exc
    return rows


def validate_row(row: dict[str, Any], index: int) -> None:
    messages = row.get("messages")
    if not isinstance(messages, list) or not messages:
        raise ValueError(f"row {index}: messages must be a non-empty list")

    roles = []
    for message in messages:
        if not isinstance(message, dict):
            raise ValueError(f"row {index}: message must be an object")
        role = message.get("role")
        content = message.get("content")
        if role not in {"system", "user", "assistant"}:
            raise ValueError(f"row {index}: invalid role {role!r}")
        if not isinstance(content, str) or not content.strip():
            raise ValueError(f"row {index}: content must be a non-empty string")
        if EMAIL_RE.search(content) or PHONE_RE.search(content):
            raise ValueError(f"row {index}: possible PII detected")
        roles.append(role)

    if "user" not in roles or "assistant" not in roles:
        raise ValueError(f"row {index}: must include at least one user and one assistant message")

    assistant_content = next(message["content"] for message in reversed(messages) if message["role"] == "assistant")
    try:
        parsed = json.loads(assistant_content)
    except json.JSONDecodeError as exc:
        raise ValueError(f"row {index}: assistant content must be JSON") from exc

    required = {"category", "priority", "answer"}
    if set(parsed) != required:
        raise ValueError(f"row {index}: assistant JSON keys must be {sorted(required)}")


def split_rows(rows: list[dict[str, Any]], eval_ratio: float, seed: int) -> tuple[list[dict[str, Any]], list[dict[str, Any]]]:
    if len(rows) < 10:
        raise ValueError("Need at least 10 examples for a meaningful train/eval split")
    rng = random.Random(seed)
    shuffled = rows[:]
    rng.shuffle(shuffled)
    eval_size = max(1, int(len(shuffled) * eval_ratio))
    return shuffled[eval_size:], shuffled[:eval_size]


def load_model_and_tokenizer(config: TrainConfig):
    tokenizer = AutoTokenizer.from_pretrained(config.model_id, use_fast=True)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    quantization_config = None
    if config.use_qlora:
        compute_dtype = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else torch.float16
        quantization_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_use_double_quant=True,
            bnb_4bit_compute_dtype=compute_dtype,
        )

    model = AutoModelForCausalLM.from_pretrained(
        config.model_id,
        quantization_config=quantization_config,
        device_map="auto" if torch.cuda.is_available() else None,
        torch_dtype=torch.bfloat16 if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else torch.float32,
    )

    if config.use_qlora:
        model = prepare_model_for_kbit_training(model)

    return model, tokenizer


def write_metadata(config: TrainConfig, train_size: int, eval_size: int, output_dir: Path) -> None:
    metadata = {
        "created_at": datetime.now(timezone.utc).isoformat(),
        "config": asdict(config),
        "dataset": {
            "train_size": train_size,
            "eval_size": eval_size,
            "data_path": config.data_path,
        },
        "environment": {
            "torch": torch.__version__,
            "cuda_available": torch.cuda.is_available(),
            "cuda_device": torch.cuda.get_device_name(0) if torch.cuda.is_available() else None,
        },
        "notes": [
            "This directory stores a LoRA adapter, not a full merged model.",
            "Load with the same base model id/revision before inference.",
            "Run regression evaluation before merge or deploy.",
        ],
    }
    output_dir.mkdir(parents=True, exist_ok=True)
    (output_dir / "training_metadata.json").write_text(json.dumps(metadata, ensure_ascii=False, indent=2), encoding="utf-8")


def main() -> None:
    config = TrainConfig()
    set_seed(config.seed)

    rows = load_jsonl(Path(config.data_path))
    for index, row in enumerate(rows):
        validate_row(row, index)
    train_rows, eval_rows = split_rows(rows, config.eval_ratio, config.seed)

    train_ds = Dataset.from_list(train_rows)
    eval_ds = Dataset.from_list(eval_rows)

    model, tokenizer = load_model_and_tokenizer(config)

    peft_config = LoraConfig(
        r=config.lora_r,
        lora_alpha=config.lora_alpha,
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
        lora_dropout=config.lora_dropout,
        bias="none",
        task_type="CAUSAL_LM",
    )

    training_args = SFTConfig(
        output_dir=config.output_dir,
        seed=config.seed,
        data_seed=config.seed,
        num_train_epochs=config.num_train_epochs,
        per_device_train_batch_size=config.per_device_train_batch_size,
        per_device_eval_batch_size=1,
        gradient_accumulation_steps=config.gradient_accumulation_steps,
        learning_rate=config.learning_rate,
        max_length=config.max_length,
        packing=False,
        assistant_only_loss=True,
        logging_steps=5,
        eval_strategy="steps",
        eval_steps=20,
        save_strategy="epoch",
        save_total_limit=2,
        bf16=torch.cuda.is_available() and torch.cuda.is_bf16_supported(),
        fp16=torch.cuda.is_available() and not torch.cuda.is_bf16_supported(),
        report_to="none",
    )

    trainer = SFTTrainer(
        model=model,
        args=training_args,
        train_dataset=train_ds,
        eval_dataset=eval_ds,
        peft_config=peft_config,
        processing_class=tokenizer,
    )

    trainer.model.print_trainable_parameters()
    trainer.train()
    trainer.save_model(config.output_dir)
    tokenizer.save_pretrained(config.output_dir)
    write_metadata(config, len(train_rows), len(eval_rows), Path(config.output_dir))
    print(f"saved_adapter={config.output_dir}")


if __name__ == "__main__":
    main()

4. Inference Sanity Check

Sanity check không thay thế evaluation ở Day 28. Nó chỉ đảm bảo artifact load được và output có hình dạng đúng.

from __future__ import annotations

import json

import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer


BASE_MODEL_ID = "Qwen/Qwen2.5-0.5B-Instruct"
ADAPTER_DIR = "artifacts/day27_support_lora_v1"


def main() -> None:
    tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_ID, use_fast=True)
    base = AutoModelForCausalLM.from_pretrained(
        BASE_MODEL_ID,
        torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
        device_map="auto" if torch.cuda.is_available() else None,
    )
    model = PeftModel.from_pretrained(base, ADAPTER_DIR)
    model.eval()

    messages = [{"role": "user", "content": "Khách báo bị trừ tiền nhưng đơn hàng chưa tạo."}]
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    with torch.no_grad():
        output_ids = model.generate(
            **inputs,
            max_new_tokens=160,
            do_sample=False,
            temperature=None,
            top_p=None,
        )

    generated = tokenizer.decode(output_ids[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
    print(generated)

    try:
        parsed = json.loads(generated)
    except json.JSONDecodeError:
        raise SystemExit("Sanity check failed: output is not valid JSON")

    missing = {"category", "priority", "answer"} - set(parsed)
    if missing:
        raise SystemExit(f"Sanity check failed: missing keys {sorted(missing)}")


if __name__ == "__main__":
    main()

5. Merge Adapter Notes

Merge khi serving stack cần single model artifact hoặc bạn muốn giảm complexity runtime.

import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_id = "Qwen/Qwen2.5-0.5B-Instruct"
adapter_dir = "artifacts/day27_support_lora_v1"
merged_dir = "artifacts/day27_support_merged_v1"

tokenizer = AutoTokenizer.from_pretrained(base_id, use_fast=True)
base = AutoModelForCausalLM.from_pretrained(
    base_id,
    torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
    device_map="auto" if torch.cuda.is_available() else None,
)
model = PeftModel.from_pretrained(base, adapter_dir)
merged = model.merge_and_unload()
merged.save_pretrained(merged_dir, safe_serialization=True)
tokenizer.save_pretrained(merged_dir)

Sau khi merge:

Chạy lại inference sanity check trên merged model.
Chạy regression eval trước/sau.
Ghi rõ merged artifact được tạo từ base model nào và adapter nào.
Không xóa adapter gốc nếu chưa có rollback plan.

6. Best Solution Theo Context

Context	Best first move	Khi nào nâng cấp
Output JSON hay sai format	Prompt + schema validation + retry	Fine-tune nếu failure lặp lại và prompt quá dài/đắt
Thiếu facts mới/private docs	RAG/tool	Fine-tune chỉ để chuẩn hóa wording/workflow
Tone support không đồng nhất	LoRA nhỏ	Tăng `r` hoặc data khi eval cho thấy underfit
GPU giới hạn, muốn train 7B	QLoRA	Chuyển LoRA bf16 nếu có GPU đủ và cần speed
Multi-domain adapters	Giữ adapter riêng	Merge khi domain cố định và serving không support adapter
Traffic lớn, task hẹp	Fine-tune/distill model nhỏ	Chỉ deploy khi latency/cost tốt hơn baseline

7. Production Readiness Checklist

8. Tài Liệu Tham Khảo

Hugging Face PEFT docs: LoraConfig, PeftModel, merge_and_unload.
Hugging Face TRL docs: SFTTrainer, SFTConfig, conversational messages.
Hugging Face Transformers docs: BitsAndBytesConfig, 4-bit quantization.
LoRA paper: Low-Rank Adaptation of Large Language Models.
QLoRA paper: Efficient Finetuning of Quantized LLMs.

Bài tập

Bài 1: Chuẩn Bị Dataset

Tạo file data/day27_support_sft.jsonl với ít nhất 30 examples theo format:

{"messages":[{"role":"user","content":"Khách muốn đổi địa chỉ giao hàng."},{"role":"assistant","content":"{\"category\":\"shipping\",\"priority\":\"medium\",\"answer\":\"Mình đã ghi nhận yêu cầu đổi địa chỉ giao hàng. Vui lòng cung cấp mã đơn hàng và địa chỉ mới để mình kiểm tra khả năng cập nhật.\"}"}]}

Yêu cầu:

Có ít nhất 3 category: billing, shipping, account.
Có ít nhất 3 priority: low, medium, high.
Không dùng email, số điện thoại, mã thẻ, token thật.
Assistant output phải là JSON string parse được.

Bài 2: Chạy LoRA Hoặc QLoRA

Chạy training trên model nhỏ:

MODEL_ID=Qwen/Qwen2.5-0.5B-Instruct \
DATA_PATH=data/day27_support_sft.jsonl \
OUT_DIR=artifacts/day27_support_lora_v1 \
USE_QLORA=1 \
SEED=42 \
LORA_R=16 \
LORA_ALPHA=32 \
LORA_DROPOUT=0.05 \
python scripts/train_lora_sft.py

Ghi lại:

GPU name và VRAM.
max_length.
Batch size và gradient accumulation.
Trainable parameter percentage.
Train loss/eval loss cuối cùng.
Thời gian train.
Artifact size.

Bài 3: Thử Ba Config LoRA

Chạy tối thiểu 3 biến thể:

Run	`r`	`alpha`	`target_modules`	Kỳ vọng
A	8	16	`q_proj,v_proj`	Rẻ, ít capacity
B	16	32	`q_proj,k_proj,v_proj,o_proj`	Default cân bằng
C	32	64	attention + MLP projections	Mạnh hơn, tốn hơn

So sánh:

VRAM peak.
Training time.
Format accuracy trên 20 prompts.
Output có bám tone không.
Có dấu hiệu overfit không.

Bài 4: Inference Sanity Check

Viết 5 prompts chưa xuất hiện trong train set:

Khách báo bị trừ tiền nhưng đơn hàng chưa tạo.
Khách muốn đổi địa chỉ nhận hàng.
Khách quên mật khẩu và không nhận được email khôi phục.
Khách hỏi vì sao mã giảm giá không áp dụng được.
Khách muốn hủy đơn đã bàn giao cho vận chuyển.

Pass criteria:

Output parse được JSON.
Có đủ category, priority, answer.
answer không bịa policy cụ thể nếu input chưa có thông tin.
Tone nhất quán, lịch sự, ngắn gọn.
Có next action rõ.

Bài 5: Merge Hay Không Merge

Trả lời bằng một decision record ngắn:

Decision: giữ adapter riêng hay merge?
Context:
Options:
Trade-off:
Decision:
Rollback: