Day 26: Dataset Preparation cho Instruction Tuning

Mục tiêu

Sau bài này, bạn cần làm được các việc sau:

Hiểu instruction tuning dataset là gì và khác classification dataset ở đâu.
Chọn đúng format giữa Alpaca, ShareGPT và ChatML/messages theo context.
Thiết kế schema có metadata, source, quality score, PII status và split.
Làm cleaning, normalization, deduplication và train/validation/test split.
Biết dùng synthetic data có kiểm soát, không biến dataset thành dữ liệu nhiễu.
Trả lời được: dataset này dùng được trong production không, và cần điều kiện gì.
Chuẩn bị dataset 500 examples cho Day 27 LoRA/QLoRA.

TL;DR

Instruction tuning dataset là tập ví dụ mô tả behavior mà bạn muốn model học. Mỗi record nói với model: khi người dùng hỏi kiểu này, assistant nên trả lời theo format, tone, policy và mức độ chi tiết như thế nào. Data không tốt sẽ tạo behavior không tốt, kể cả khi training code đúng.

Quy tắc quan trọng nhất: quality > quantity. Với fine-tuning nhỏ, 300-1.000 examples tốt, đa dạng và đúng policy thường hữu ích hơn 50.000 examples log thô chứa PII, duplicate, câu trả lời sai hoặc style lộn xộn.

1. Instruction Tuning là gì?

Instruction tuning dạy model ánh xạ từ instruction và context sang response mong muốn:

system policy + user instruction + optional history/input -> assistant response

Khác với pretraining, instruction tuning không nhằm nhồi toàn bộ knowledge mới vào model. Nó phù hợp hơn để dạy model cách hành xử:

Trả lời theo format ổn định: JSON, markdown, checklist, code review comment.
Giữ tone/style nhất quán: ngắn gọn, lịch sự, chuyên nghiệp, không đổ lỗi.
Tuân thủ quy trình: hỏi clarification, escalate, từ chối khi thiếu quyền.
Học domain behavior: cách phản hồi billing ticket, cách review PR, cách viết tài liệu nội bộ.
Giảm prompt dài nếu cùng một instruction lặp lại ở nhiều request.

Không nên dùng instruction tuning để thay thế RAG khi knowledge thay đổi thường xuyên hoặc cần trích dẫn tài liệu mới nhất. Khi cần cả behavior riêng và knowledge nội bộ, best solution thường là RAG + fine-tuning nhẹ: fine-tuning cho format/style/policy, RAG cho knowledge.

2. Khác gì với Classification Dataset?

Tiêu chí	Classification	Instruction tuning
Input	Text hoặc feature	Instruction, context, history, system policy
Output	Label ngắn	Text/code/JSON dài
Metric	Accuracy, F1, precision, recall	Format accuracy, task success, human preference, regression eval
Debug	Dễ thấy label sai	Khó hơn vì nhiều câu trả lời có thể đúng
Risk	Label noise	Hallucination, unsafe policy, PII memorization, style drift
Production concern	Threshold, class imbalance	Serving format, prompt compatibility, eval leakage, rollback

Ví dụ classification:

{"text":"Tôi bị tính phí 2 lần sau khi nâng cấp.", "label":"billing_issue"}

Ví dụ instruction tuning:

{"messages":[{"role":"user","content":"Tôi bị tính phí 2 lần sau khi nâng cấp."},{"role":"assistant","content":"Mình xin lỗi vì bất tiện này. Mình sẽ kiểm tra giao dịch nâng cấp và chuyển ticket sang billing nếu có dấu hiệu tính phí trùng. Bạn vui lòng gửi mã hóa đơn hoặc thời điểm giao dịch, không gửi số thẻ."}]}

Classification chỉ dạy model phân loại. Instruction tuning dạy model cách phản hồi hoàn chỉnh.

3. Chọn Format Dataset

Không có format tốt nhất cho mọi trường hợp. Format tốt là format gần với cách bạn sẽ train và serve model nhất.

3.1. Alpaca Format

Alpaca format hợp với task single-turn, đơn giản, dễ inspect bằng mắt và dễ convert.

{"instruction":"Tóm tắt ticket sau thành 3 bullet.","input":"Khách báo bị tính phí 2 lần sau khi upgrade.","output":"- Khách bị tính phí 2 lần.\n- Sự cố xảy ra sau khi upgrade.\n- Cần kiểm tra billing và refund nếu đúng chính sách."}

Ưu điểm:

Dễ tạo bằng spreadsheet hoặc script.
Dễ hiểu với người mới.
Hợp cho summarize, rewrite, classify bằng natural language, extraction.

Nhược điểm:

Không thể hiện role rõ như system/user/assistant.
Không tự nhiên cho multi-turn chat.
Nếu production dùng chat API, bạn vẫn phải convert sang messages.

Nên dùng khi: task single-turn, chưa cần system prompt phức tạp, dataset nhỏ cần review nhanh.

3.2. ShareGPT Format

ShareGPT format thường dùng cho conversation multi-turn cũ, với role human và gpt.

{"conversations":[{"from":"human","value":"Khách bị tính phí 2 lần, nên trả lời sao?"},{"from":"gpt","value":"Bạn nên xin lỗi ngắn gọn, xác nhận sẽ kiểm tra billing, hỏi thông tin cần thiết và không hứa refund khi chưa có kết quả."}]}

Ưu điểm:

Phù hợp khi dữ liệu nguồn đã là hội thoại.
Giữ được nhiều turn.
Nhiều tool fine-tuning cũ hỗ trợ.

Nhược điểm:

Role không chuẩn với chat API hiện đại.
Thiếu system role nếu không mở rộng schema.
Cần convert cẩn thận để không đảo role hoặc mất turn.

Nên dùng khi: bạn có conversation logs đã được redact và muốn giữ multi-turn context.

3.3. ChatML / Messages Format

Messages format gần với production chat API nhất. Mỗi record có list message với role và content.

{"messages":[{"role":"system","content":"You are a support assistant. Answer in Vietnamese, concise, no refund promise."},{"role":"user","content":"Khách bị tính phí 2 lần sau khi upgrade."},{"role":"assistant","content":"Mình xin lỗi vì bất tiện này. Mình sẽ giúp kiểm tra giao dịch upgrade và chuyển ticket sang billing nếu có dấu hiệu tính phí trùng. Bạn vui lòng gửi mã hóa đơn hoặc thời điểm giao dịch, không gửi số thẻ."}]}

Ưu điểm:

Gần production serving nhất.
Có role rõ: system, user, assistant.
Phù hợp multi-turn, policy, refusal, tool-style behavior.
Dễ mask loss để chỉ train trên assistant response nếu trainer hỗ trợ.

Nhược điểm:

Dài hơn Alpaca.
Cần validate role order nghiêm túc.
Một số trainer cần apply chat template trước khi train.

Nên dùng khi: production là chat assistant, cần system instruction, cần multi-turn, hoặc muốn giảm mismatch giữa training và inference.

3.4. Decision Rule

Nhu cầu	Format nên dùng
Single-turn, task hẹp, dễ review	Alpaca
Có conversation history nhiều turn	ShareGPT hoặc messages
Production dùng chat API	ChatML/messages
Cần system instruction rõ	ChatML/messages
Cần convert qua nhiều tool	Alpaca hoặc messages
Cần tool calling sau này	Messages với metadata/tool fields riêng

Best solution cho đa số production chat assistant: lưu canonical dataset ở messages format, sau đó viết converter sang Alpaca/ShareGPT nếu trainer yêu cầu.

4. Schema nên dùng

Một record production-friendly nên có cả payload training và metadata audit.

{"id":"support_000001","domain":"customer_support","task":"billing_reply","source":"human_seed","quality_score":5,"pii_status":"redacted","license":"internal_training_allowed","group_id":"ticket_family_2026_001","messages":[{"role":"system","content":"You are a support assistant for a SaaS product. Answer in Vietnamese, concise, and do not promise refunds without billing verification."},{"role":"user","content":"Tôi bị tính phí 2 lần sau khi upgrade, email của tôi là [EMAIL]."},{"role":"assistant","content":"Mình xin lỗi vì bất tiện này. Mình sẽ kiểm tra giao dịch upgrade và chuyển ticket sang billing nếu có dấu hiệu tính phí trùng. Bạn không cần gửi thêm thông tin thẻ qua chat."}]}

Field nên có:

Field	Bắt buộc	Ý nghĩa
`id`	Có	Stable id để trace, review, rollback
`domain`	Có	Domain như `customer_support`, `code_review`, `technical_writing`
`task`	Có	Loại task cụ thể như `billing_reply`, `bug_summary`
`source`	Có	`human_seed`, `synthetic_reviewed`, `production_redacted`
`quality_score`	Có	Điểm 1-5, chỉ train record đạt ngưỡng
`pii_status`	Có	`clean`, `redacted`, `needs_review`
`license`	Có	Quyền dùng data để train
`group_id`	Nên có	Dùng để split không leakage theo ticket/document/customer group
`messages`	Có nếu dùng chat	List role/content
`split`	Sau khi split	`train`, `validation`, `test`

Không nên chỉ lưu text đã render thành prompt duy nhất. Bạn sẽ mất khả năng kiểm tra role, redact từng phần, convert format và audit source.

5. Data Guideline trước khi tạo data

Dataset guideline giống API contract. Nếu guideline mơ hồ, reviewer và synthetic generator sẽ tạo output không nhất quán.

Ví dụ guideline cho customer support:

Trả lời bằng tiếng Việt có dấu, ngắn gọn, lịch sự.
Mở đầu bằng việc ghi nhận vấn đề, không đổ lỗi người dùng.
Không hứa refund, mở khóa tài khoản, xóa dữ liệu hoặc thay đổi billing nếu chưa có tool result.
Nếu thiếu thông tin, hỏi đúng một hoặc hai thông tin cần thiết.
Không yêu cầu người dùng gửi password, OTP, số thẻ, secret hoặc token.
Với billing/security/legal risk, escalate human hoặc team chuyên trách.
Không invent policy, SLA, nguyên nhân lỗi hoặc kết quả kiểm tra.
Nếu user đưa PII, không lặp lại PII trong response.
Với yêu cầu ngoài phạm vi, từ chối ngắn và đưa hướng xử lý an toàn.
Response nên có next step rõ ràng.

Example tốt:

{"messages":[{"role":"user","content":"Tài khoản của tôi bị khóa sau khi đổi mật khẩu."},{"role":"assistant","content":"Mình sẽ giúp bạn kiểm tra. Vì đây là vấn đề bảo mật, mình không thể mở khóa trực tiếp qua chat khi chưa xác minh tài khoản. Mình sẽ tạo ticket cho nhóm support và hướng dẫn bạn các bước xác minh tiếp theo."}]}

Example xấu:

{"messages":[{"role":"user","content":"Tài khoản của tôi bị khóa sau khi đổi mật khẩu."},{"role":"assistant","content":"Chắc hệ thống lỗi, mình sẽ mở khóa ngay cho bạn."}]}

Lý do xấu: invent nguyên nhân, hứa action không có permission, bỏ qua xác minh bảo mật.

6. Cleaning và Normalization

Cleaning tối thiểu:

Parse được từng dòng JSONL.
Không có instruction, user message hoặc assistant response rỗng.
Normalize whitespace, line ending và ký tự control.
Loại HTML/script/boilerplate nếu không phải signal cần học.
Loại hoặc sửa record sai role order.
Loại record quá dài vượt context budget.
Check language mismatch nếu dataset yêu cầu tiếng Việt.
Redact PII: email, phone, API key, token, số thẻ, access token, customer id nhạy cảm.
Gắn pii_status sau redaction: clean, redacted, hoặc needs_review.
Loại response chứa policy sai hoặc hallucination nghiêm trọng.

Không nên clean quá tay. Nếu production user thường viết tắt, sai chính tả hoặc trộn tiếng Anh, bạn có thể giữ một phần để model học robust. Nhưng response của assistant nên sạch, đúng chính sách và có dấu.

7. Deduplication

Duplicate làm model overfit và làm eval đẹp giả. Dedup nên chạy trước split.

Các mức dedup:

Exact duplicate: hash normalized text.
Near duplicate: similarity theo n-gram, MinHash hoặc embedding khi dataset lớn.
Group duplicate: cùng group_id, cùng ticket family, cùng document version.
Template duplicate: synthetic data chỉ thay vài từ nhưng response gần như giống nhau.

Với dataset 500 examples, exact dedup + manual review near-duplicate thường đủ. Với dataset lớn hơn, nên thêm MinHash hoặc embedding clustering.

Trade-off: dedup mạnh quá có thể xóa những biến thể hữu ích; dedup yếu quá làm model học lặp pattern và làm validation/test bị leakage.

8. Train/Validation/Test Split

Split gợi ý:

Split	Tỷ lệ	Mục đích
Train	80-90%	Train adapter/model
Validation	5-10%	Chọn checkpoint, hyperparameter, early stopping
Test/Golden	5-10%	Báo cáo before/after, không tune vào đây

Quy tắc production: split theo group/source nếu có. Không để cùng một ticket, cùng customer thread, cùng document paragraph hoặc cùng synthetic template xuất hiện ở cả train và test.

Ví dụ xấu:

train: ticket_123 turn 1
test:  ticket_123 turn 2

Metric sẽ cao giả vì model đã thấy cùng context.

Ví dụ tốt:

train: ticket_family_001..400
validation: ticket_family_401..450
test: ticket_family_451..500

9. Quality > Quantity

Dataset tốt có các đặc điểm sau:

Mỗi example đúng policy.
Task đa dạng nhưng vẫn cùng behavior mục tiêu.
Response có format ổn định.
Có edge cases: thiếu thông tin, user tức giận, request không an toàn, PII, escalation.
Có negative/refusal examples vừa đủ.
Có metadata rõ để audit.
Có test set không bị dùng để tune.

Dataset kém thường có:

Nhiều duplicate.
Response dài lan man.
Synthetic pattern lặp.
Log thô chứa PII.
Mixing style: lúc thân mật, lúc quá formal, lúc tiếng Anh, lúc tiếng Việt không dấu.
Output sai quyền hạn: hứa refund, hứa xóa dữ liệu, tự tạo kết quả kiểm tra.

10. Synthetic Data

Synthetic data hữu ích khi bạn chưa có đủ examples, nhưng phải có guardrails.

Workflow nên dùng:

Guideline rõ ràng
  -> 30-50 human seed examples chất lượng cao
  -> generate variants theo task/domain/edge case
  -> validate schema
  -> redact PII giả hoặc thật
  -> dedup
  -> human review sample
  -> tag source = synthetic_reviewed
  -> split theo group/template

Nên dùng synthetic cho:

Format variants.
Edge cases hiếm.
Tone/style examples.
Refusal behavior.
Domain template có source rõ.

Không nên dùng synthetic khi:

Không có human seed examples.
Generator invent facts, policy hoặc product behavior.
Không có human review.
Synthetic output đến từ chính model sẽ được dùng để eval.
Dataset bị thống trị bởi một template.

Best practice: với 500 examples đầu tiên, dùng 30-50 human seed, sinh thêm 450-470 synthetic candidates, sau validation/dedup/review chỉ giữ 500 record tốt nhất. Nếu có nguồn production đã redact và được phép train, trộn thêm để tăng tính thực tế.

11. Privacy, License và Data Ownership

Fine-tuning có risk memorization. Nếu một chuỗi không được phép xuất hiện trong output, đừng đưa vào train set.

Privacy checklist:

Data có consent hoặc quyền hợp pháp để train không?
Có PII nào cần redact không: email, phone, address, full name, IP, account id?
Có secret/token/API key/log nội bộ không?
Có customer confidential data không?
Có chính sách retention/deletion cho dataset và model artifact không?
Adapter/model có được share public không?
License của public dataset có cho commercial training không?
Có thể trace từ model version về dataset version không?

Production rule: pii_status = needs_review không được vào train. Public data không mặc định được phép dùng cho commercial fine-tuning.

12. Performance và Cost Concern

Dataset preparation ảnh hưởng trực tiếp đến training cost:

Sequence length dài làm VRAM và thời gian train tăng mạnh. Theo dõi p50/p95 token length.
Response ngắn, đúng trọng tâm thường tốt hơn response dài nhưng lan man.
Duplicate làm tốn compute và tăng overfit.
Packing nhiều short examples vào một sequence có thể tăng throughput nếu trainer hỗ trợ.
Dataset 500 examples phù hợp để học format/style nhỏ, không đủ để học knowledge rộng.
Với QLoRA Day 27, nên giữ max length thực tế trong khoảng 512-2.048 tokens tùy GPU và model.
Chọn messages format có thể cần apply chat template, làm token length tăng so với Alpaca.

Trade-off: cắt ngắn dữ liệu giúp giảm cost nhưng có thể mất context quan trọng; giữ context dài giúp học multi-turn tốt hơn nhưng tốn VRAM và dễ học noise.

13. Dùng được trong production không?

Có, dataset từ bài này dùng được làm đầu vào production fine-tuning nếu thỏa các điều kiện sau:

Có quyền dùng data để train và deploy model/adapters.
Không có PII/secret hoặc đã redact và review.
Schema validate tự động, không có record sai role/order/field.
Dedup trước split và split theo group để giảm leakage.
Có dataset card, metadata, version, changelog và mapping tới model artifact.
Có golden test set riêng không dùng để tune.
Có human review cho sample đại diện, đặc biệt synthetic data và edge cases.
Có eval trước/sau fine-tune ở Day 28: format accuracy, task success, safety regression, latency/cost.
Có rollback plan nếu adapter tạo behavior xấu.

Nếu thiếu các điều kiện này, dataset vẫn có thể dùng cho lab hoặc prototype, nhưng chưa nên dùng để train model đi production.

14. Checklist nhanh

15. Quiz tự kiểm tra

Instruction tuning dataset khác classification dataset ở điểm nào?
Khi nào nên chọn Alpaca thay vì messages format?
Vì sao canonical dataset cho chat assistant nên giữ role system, user, assistant?
Vì sao dedup phải chạy trước split?
Near-duplicate trong synthetic data gây hại như thế nào?
Tại sao không nên train trên raw customer logs chưa redact?
quality_score nên dùng để làm gì?
Khi nào grouped split tốt hơn random split?
Fine-tuning có nên dùng để nhồi tài liệu nội bộ không? Vì sao?
Dataset 500 examples có thể production được trong trường hợp nào?

Tài liệu

1. Cấu trúc thư mục khuyến nghị

instruction_dataset/
  raw/
    human_seed.jsonl
    synthetic_candidates.jsonl
  processed/
    dataset_clean.jsonl
    dataset_split.jsonl
  reports/
    validation_report.md
    review_sample.md
  dataset_card.md
  metadata.json
  prepare_dataset.py

Nguyên tắc:

raw/ giữ dữ liệu gốc đã được phép xử lý, nhưng không commit PII/secret.
processed/ là dữ liệu đã normalize, redact, dedup và split.
reports/ lưu kết quả validation, lỗi bị loại, sample cần human review.
dataset_card.md giải thích mục đích, nguồn, license, privacy, known limitations.
metadata.json gắn dataset version với model/adapters ở Day 27.

2. Canonical JSONL schema

Mỗi dòng là một JSON object.

{"id":"support_000001","domain":"customer_support","task":"billing_reply","source":"human_seed","quality_score":5,"pii_status":"clean","license":"internal_training_allowed","group_id":"ticket_family_000001","messages":[{"role":"system","content":"You are a support assistant for a SaaS product. Answer in Vietnamese, concise, and do not promise refunds without billing verification."},{"role":"user","content":"Tôi bị tính phí 2 lần sau khi upgrade."},{"role":"assistant","content":"Mình xin lỗi vì bất tiện này. Mình sẽ kiểm tra giao dịch upgrade và chuyển ticket sang billing nếu có dấu hiệu tính phí trùng. Bạn vui lòng gửi mã hóa đơn hoặc thời điểm giao dịch, không gửi số thẻ."}]}

Required fields:

id: unique stable id.
domain: domain của dataset.
task: task cụ thể.
source: human_seed, synthetic_reviewed, production_redacted.
quality_score: integer 1-5.
pii_status: clean, redacted, needs_review.
license: quyền dùng data.
group_id: dùng cho grouped split.
messages: list message với role hợp lệ.

Allowed roles: system, user, assistant.

Role order hợp lệ:

Optional first system.
Sau đó phải là cặp user -> assistant.
Record phải kết thúc bằng assistant.
Không có assistant rỗng.

3. Script chuẩn bị dataset gần production

Script dưới đây dùng Python standard library để dễ chạy. Nó validate schema, normalize record, redact PII phổ biến, deduplicate, split theo group_id và xuất dataset card/metadata cơ bản.

Lưu thành instruction_dataset/prepare_dataset.py, đặt input ở raw/input.jsonl, rồi chạy:

python3 prepare_dataset.py --input raw/input.jsonl --out-dir processed --dataset-name support_instruction_v1

from __future__ import annotations

import argparse
import hashlib
import json
import random
import re
from collections import Counter, defaultdict
from dataclasses import dataclass
from datetime import datetime, timezone
from pathlib import Path
from typing import Any

ALLOWED_ROLES = {"system", "user", "assistant"}
ALLOWED_PII_STATUS = {"clean", "redacted", "needs_review"}
ALLOWED_SOURCES = {"human_seed", "synthetic_reviewed", "production_redacted"}
ALLOWED_LICENSES = {"internal_training_allowed", "public_commercial_allowed", "research_only"}

EMAIL_RE = re.compile(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b")
PHONE_RE = re.compile(r"\b(?:\+?\d[\d .-]{8,}\d)\b")
API_KEY_RE = re.compile(r"\b(?:sk|pk|api|token|key)[-_]?[A-Za-z0-9]{16,}\b", re.IGNORECASE)
CREDIT_CARD_RE = re.compile(r"\b(?:\d[ -]*?){13,19}\b")
CONTROL_CHARS_RE = re.compile(r"[\x00-\x08\x0b\x0c\x0e-\x1f]")
WHITESPACE_RE = re.compile(r"[ \t]+")


@dataclass
class PreparedDataset:
    rows: list[dict[str, Any]]
    errors: list[str]
    warnings: list[str]
    counters: Counter[str]


def load_jsonl(path: Path) -> tuple[list[dict[str, Any]], list[str]]:
    rows: list[dict[str, Any]] = []
    errors: list[str] = []
    for line_no, line in enumerate(path.read_text(encoding="utf-8").splitlines(), start=1):
        if not line.strip():
            continue
        try:
            value = json.loads(line)
        except json.JSONDecodeError as exc:
            errors.append(f"line {line_no}: invalid JSON: {exc.msg}")
            continue
        if not isinstance(value, dict):
            errors.append(f"line {line_no}: row must be an object")
            continue
        value["_line_no"] = line_no
        rows.append(value)
    return rows, errors


def normalize_text(text: str) -> str:
    text = CONTROL_CHARS_RE.sub("", text)
    text = text.replace("\r\n", "\n").replace("\r", "\n")
    text = "\n".join(WHITESPACE_RE.sub(" ", part).strip() for part in text.split("\n"))
    return text.strip()


def redact_pii(text: str) -> tuple[str, bool]:
    original = text
    text = EMAIL_RE.sub("[EMAIL]", text)
    text = PHONE_RE.sub("[PHONE]", text)
    text = API_KEY_RE.sub("[SECRET]", text)
    text = CREDIT_CARD_RE.sub("[CARD]", text)
    return text, text != original


def validate_messages(row: dict[str, Any], row_id: str) -> list[str]:
    errors: list[str] = []
    messages = row.get("messages")
    if not isinstance(messages, list) or not messages:
        return [f"{row_id}: messages must be a non-empty list"]

    roles: list[str] = []
    for idx, message in enumerate(messages):
        if not isinstance(message, dict):
            errors.append(f"{row_id}: messages[{idx}] must be an object")
            continue
        role = message.get("role")
        content = message.get("content")
        if role not in ALLOWED_ROLES:
            errors.append(f"{row_id}: messages[{idx}].role is invalid: {role!r}")
        if not isinstance(content, str) or not content.strip():
            errors.append(f"{row_id}: messages[{idx}].content is empty")
        roles.append(role)

    start = 1 if roles and roles[0] == "system" else 0
    expected = "user"
    for idx in range(start, len(roles)):
        if roles[idx] != expected:
            errors.append(f"{row_id}: expected role {expected!r} at messages[{idx}], got {roles[idx]!r}")
            break
        expected = "assistant" if expected == "user" else "user"
    if roles and roles[-1] != "assistant":
        errors.append(f"{row_id}: last message must be assistant")
    return errors


def validate_required_fields(row: dict[str, Any]) -> list[str]:
    row_id = str(row.get("id") or f"line_{row.get('_line_no', 'unknown')}")
    errors: list[str] = []
    required_string_fields = ["id", "domain", "task", "source", "pii_status", "license", "group_id"]
    for field in required_string_fields:
        if not isinstance(row.get(field), str) or not row[field].strip():
            errors.append(f"{row_id}: {field} must be a non-empty string")

    quality_score = row.get("quality_score")
    if not isinstance(quality_score, int) or not 1 <= quality_score <= 5:
        errors.append(f"{row_id}: quality_score must be an integer from 1 to 5")
    if row.get("source") not in ALLOWED_SOURCES:
        errors.append(f"{row_id}: source must be one of {sorted(ALLOWED_SOURCES)}")
    if row.get("pii_status") not in ALLOWED_PII_STATUS:
        errors.append(f"{row_id}: pii_status must be one of {sorted(ALLOWED_PII_STATUS)}")
    if row.get("license") not in ALLOWED_LICENSES:
        errors.append(f"{row_id}: license must be one of {sorted(ALLOWED_LICENSES)}")
    errors.extend(validate_messages(row, row_id))
    return errors


def canonical_training_text(row: dict[str, Any]) -> str:
    parts = []
    for message in row["messages"]:
        parts.append(f"{message['role']}:{normalize_text(message['content']).lower()}")
    return "\n".join(parts)


def normalize_and_redact_row(row: dict[str, Any]) -> tuple[dict[str, Any], bool]:
    clean = {key: value for key, value in row.items() if not key.startswith("_")}
    redacted_any = False
    for key in ["id", "domain", "task", "source", "pii_status", "license", "group_id"]:
        if isinstance(clean.get(key), str):
            clean[key] = normalize_text(clean[key])
    normalized_messages = []
    for message in clean["messages"]:
        content = normalize_text(message["content"])
        content, redacted = redact_pii(content)
        redacted_any = redacted_any or redacted
        normalized_messages.append({"role": message["role"], "content": content})
    clean["messages"] = normalized_messages
    if redacted_any:
        clean["pii_status"] = "redacted"
    return clean, redacted_any


def split_by_group(rows: list[dict[str, Any]], seed: int) -> list[dict[str, Any]]:
    grouped: dict[str, list[dict[str, Any]]] = defaultdict(list)
    for row in rows:
        grouped[row["group_id"]].append(row)

    groups = list(grouped)
    random.Random(seed).shuffle(groups)
    total_rows = len(rows)
    train_limit = int(total_rows * 0.8)
    validation_limit = int(total_rows * 0.9)

    output: list[dict[str, Any]] = []
    seen_count = 0
    for group_id in groups:
        if seen_count < train_limit:
            split = "train"
        elif seen_count < validation_limit:
            split = "validation"
        else:
            split = "test"
        for row in grouped[group_id]:
            row["split"] = split
            output.append(row)
        seen_count += len(grouped[group_id])
    return output


def prepare(rows: list[dict[str, Any]], parse_errors: list[str], seed: int, min_quality: int) -> PreparedDataset:
    errors = list(parse_errors)
    warnings: list[str] = []
    counters: Counter[str] = Counter()
    valid_rows: list[dict[str, Any]] = []
    seen_ids: set[str] = set()
    seen_hashes: set[str] = set()

    for raw in rows:
        row_id = str(raw.get("id") or f"line_{raw.get('_line_no', 'unknown')}")
        field_errors = validate_required_fields(raw)
        if field_errors:
            errors.extend(field_errors)
            counters["invalid_schema"] += 1
            continue
        if raw["id"] in seen_ids:
            errors.append(f"{row_id}: duplicate id")
            counters["duplicate_id"] += 1
            continue
        seen_ids.add(raw["id"])

        row, redacted = normalize_and_redact_row(raw)
        if row["quality_score"] < min_quality:
            warnings.append(f"{row_id}: dropped because quality_score < {min_quality}")
            counters["dropped_low_quality"] += 1
            continue
        if row["pii_status"] == "needs_review":
            warnings.append(f"{row_id}: dropped because pii_status is needs_review")
            counters["dropped_needs_review"] += 1
            continue
        if row["license"] == "research_only":
            warnings.append(f"{row_id}: dropped because license is research_only")
            counters["dropped_license"] += 1
            continue
        if redacted:
            counters["redacted"] += 1

        digest = hashlib.sha256(canonical_training_text(row).encode("utf-8")).hexdigest()
        if digest in seen_hashes:
            warnings.append(f"{row_id}: dropped exact duplicate")
            counters["dropped_duplicate"] += 1
            continue
        seen_hashes.add(digest)
        valid_rows.append(row)

    split_rows = split_by_group(valid_rows, seed)
    counters.update(Counter(row["split"] for row in split_rows))
    return PreparedDataset(rows=split_rows, errors=errors, warnings=warnings, counters=counters)


def write_jsonl(path: Path, rows: list[dict[str, Any]]) -> None:
    path.write_text(
        "\n".join(json.dumps(row, ensure_ascii=False, sort_keys=True) for row in rows) + ("\n" if rows else ""),
        encoding="utf-8",
    )


def write_report(path: Path, result: PreparedDataset) -> None:
    lines = ["# Validation Report", ""]
    lines.append("## Counters")
    for key, value in sorted(result.counters.items()):
        lines.append(f"- {key}: {value}")
    lines.extend(["", "## Errors"])
    lines.extend(f"- {error}" for error in result.errors[:200])
    if len(result.errors) > 200:
        lines.append(f"- ... truncated {len(result.errors) - 200} more errors")
    lines.extend(["", "## Warnings"])
    lines.extend(f"- {warning}" for warning in result.warnings[:200])
    if len(result.warnings) > 200:
        lines.append(f"- ... truncated {len(result.warnings) - 200} more warnings")
    path.write_text("\n".join(lines) + "\n", encoding="utf-8")


def write_dataset_card(path: Path, dataset_name: str, result: PreparedDataset) -> None:
    split_counts = Counter(row["split"] for row in result.rows)
    source_counts = Counter(row["source"] for row in result.rows)
    task_counts = Counter(row["task"] for row in result.rows)
    lines = [
        f"# Dataset Card: {dataset_name}",
        "",
        "## Purpose",
        "Instruction tuning dataset for a domain assistant. Replace this section with the exact behavior goal before training.",
        "",
        "## Format",
        "JSONL messages format with role/content pairs and metadata fields.",
        "",
        "## Size",
        f"- total: {len(result.rows)}",
        f"- train: {split_counts.get('train', 0)}",
        f"- validation: {split_counts.get('validation', 0)}",
        f"- test: {split_counts.get('test', 0)}",
        "",
        "## Sources",
    ]
    lines.extend(f"- {source}: {count}" for source, count in sorted(source_counts.items()))
    lines.extend(["", "## Tasks"])
    lines.extend(f"- {task}: {count}" for task, count in sorted(task_counts.items()))
    lines.extend(
        [
            "",
            "## Privacy",
            "PII is either absent or redacted by the preparation script. Records marked needs_review are excluded.",
            "",
            "## License",
            "Only rows with training-allowed licenses are included. Verify this before production use.",
            "",
            "## Known Limitations",
            "- Regex redaction does not catch every possible PII type.",
            "- Near-duplicate detection still requires human review or a stronger similarity pipeline for large datasets.",
            "- Synthetic data quality depends on seed examples and review process.",
        ]
    )
    path.write_text("\n".join(lines) + "\n", encoding="utf-8")


def write_metadata(path: Path, dataset_name: str, input_path: Path, result: PreparedDataset, seed: int) -> None:
    metadata = {
        "dataset_name": dataset_name,
        "created_at": datetime.now(timezone.utc).isoformat(),
        "input_path": str(input_path),
        "format": "messages_jsonl",
        "seed": seed,
        "rows": len(result.rows),
        "counters": dict(result.counters),
        "production_ready": len(result.errors) == 0 and len(result.rows) > 0,
    }
    path.write_text(json.dumps(metadata, ensure_ascii=False, indent=2, sort_keys=True) + "\n", encoding="utf-8")


def main() -> None:
    parser = argparse.ArgumentParser()
    parser.add_argument("--input", required=True, type=Path)
    parser.add_argument("--out-dir", required=True, type=Path)
    parser.add_argument("--dataset-name", required=True)
    parser.add_argument("--seed", type=int, default=42)
    parser.add_argument("--min-quality", type=int, default=4)
    args = parser.parse_args()

    args.out_dir.mkdir(parents=True, exist_ok=True)
    rows, parse_errors = load_jsonl(args.input)
    result = prepare(rows, parse_errors, seed=args.seed, min_quality=args.min_quality)

    write_jsonl(args.out_dir / "dataset_split.jsonl", result.rows)
    write_report(args.out_dir / "validation_report.md", result)
    write_dataset_card(args.out_dir / "dataset_card.md", args.dataset_name, result)
    write_metadata(args.out_dir / "metadata.json", args.dataset_name, args.input, result, args.seed)

    print(f"rows_in: {len(rows)}")
    print(f"rows_out: {len(result.rows)}")
    print(f"errors: {len(result.errors)}")
    print(f"warnings: {len(result.warnings)}")
    for key, value in sorted(result.counters.items()):
        print(f"{key}: {value}")


if __name__ == "__main__":
    main()

4. Dataset card template

# Dataset Card: support_instruction_v1

## Purpose
Instruction tuning cho support assistant xử lý billing/account/how-to bằng tiếng Việt.

## Behavior Goal
- Trả lời ngắn gọn, lịch sự, có next step.
- Không hứa refund, mở khóa, xóa dữ liệu hoặc thay đổi billing nếu chưa có tool result.
- Escalate billing/security/legal risk.

## Format
JSONL messages format.

## Size
- Total: 500
- Train: 400
- Validation: 50
- Test: 50

## Sources
- Human seed: 50
- Synthetic reviewed: 350
- Production redacted: 100

## Privacy
- PII redacted.
- No password, OTP, API key, access token or raw customer identifier.
- Records marked needs_review are excluded.

## License and Ownership
Internal training allowed. Adapter is not approved for public release unless legal review approves.

## Quality Process
- Schema validation passed.
- Exact dedup completed before split.
- 50 random examples manually reviewed.
- 20 golden examples reserved for Day 28 eval.

## Known Limitations
- Regex redaction may miss rare PII patterns.
- Near-duplicate synthetic variants require manual review.
- Dataset teaches response behavior, not product knowledge.

5. Metadata template

{
  "dataset_name": "support_instruction_v1",
  "format": "messages_jsonl",
  "version": "2026-05-10",
  "owner": "ai-engineer-course",
  "intended_model_family": "Qwen/LLaMA-compatible chat model",
  "rows": 500,
  "splits": {
    "train": 400,
    "validation": 50,
    "test": 50
  },
  "sources": {
    "human_seed": 50,
    "synthetic_reviewed": 350,
    "production_redacted": 100
  },
  "privacy": {
    "pii_redacted": true,
    "needs_review_excluded": true,
    "secret_scan_required": true
  },
  "training_allowed": true,
  "notes": "Use for behavior/style fine-tuning, not as knowledge source."
}

6. Review checklist cho production

7. Trade-off thường gặp

Quyết định	Lợi ích	Rủi ro	Best solution theo context
Messages format	Gần production chat API, role rõ	Token dài hơn, cần chat template	Dùng làm canonical cho chat assistant
Alpaca format	Đơn giản, dễ inspect	Kém multi-turn và policy role	Dùng cho single-turn task hoặc converter output
Synthetic nhiều	Scale nhanh, phủ edge cases	Pattern lặp, hallucinated policy	Giữ synthetic dưới kiểm soát, tag source, review sample
Raw production logs	Rất thực tế	PII, consent, noise, unsafe response	Chỉ dùng sau redaction, license check, quality filtering
Random split	Nhanh	Leakage theo ticket/document	Chỉ dùng khi không có group và dataset độc lập
Grouped split	Eval thật hơn	Split ratio có thể lệch	Dùng cho production dataset
Cắt max length thấp	Giảm VRAM/cost	Mất context multi-turn	Đo p95 token length rồi chọn limit

8. Production answer

Dùng được trong production không? Có, nếu dataset pass schema validation, privacy/license review, dedup, grouped split, human review và eval trước/sau fine-tune. Nếu chỉ mới có dataset tạo nhanh từ synthetic data chưa review, nó chỉ phù hợp prototype hoặc lab.

Bài tập

Mục tiêu

Tạo một dataset 500 examples cho một domain, sẵn sàng dùng ở Day 27 LoRA/QLoRA và Day 28 evaluation.

Bạn chọn một trong bốn domain:

Customer support.
Code review.
Technical writing.
Internal policy Q&A.

Output cuối cùng:

instruction_dataset/
  raw/input.jsonl
  processed/dataset_split.jsonl
  processed/dataset_card.md
  processed/metadata.json
  processed/validation_report.md

Phần 1: Viết behavior guideline

Viết 10 dòng guideline cho assistant. Template:

# Behavior Guideline

Domain: customer_support
Language: Vietnamese with English technical terms when needed

1. ...
2. ...
10. ...

Checklist:

Có tone/style.
Có format response.
Có điều kiện hỏi clarification.
Có điều kiện escalate.
Có điều cấm: không invent facts, không yêu cầu password/OTP/secret.
Có privacy rule.

Phần 2: Tạo 30-50 human seed examples

Mỗi example dùng messages format:

{"id":"support_seed_0001","domain":"customer_support","task":"billing_reply","source":"human_seed","quality_score":5,"pii_status":"clean","license":"internal_training_allowed","group_id":"support_seed_0001","messages":[{"role":"system","content":"You are a support assistant for a SaaS product. Answer in Vietnamese, concise, and do not promise refunds without billing verification."},{"role":"user","content":"Tôi bị tính phí 2 lần sau khi upgrade."},{"role":"assistant","content":"Mình xin lỗi vì bất tiện này. Mình sẽ kiểm tra giao dịch upgrade và chuyển ticket sang billing nếu có dấu hiệu tính phí trùng. Bạn vui lòng gửi mã hóa đơn hoặc thời điểm giao dịch, không gửi số thẻ."}]}

Phân bổ gợi ý cho customer support:

Task	Số examples
billing_reply	10
account_access	8
how_to	8
bug_report	6
cancellation	5
security_escalation	5
refusal_or_out_of_scope	3

Phần 3: Sinh synthetic candidates

Dùng guideline và seed examples để sinh thêm candidates. Không giữ tất cả ngay. Mỗi candidate cần có:

source = synthetic_reviewed chỉ sau khi đã review.
quality_score >= 4.
group_id theo template hoặc scenario family.
Không có PII thật.
Không invent product policy.

Prompt gợi ý để sinh synthetic data:

You are creating instruction tuning data for a Vietnamese SaaS support assistant.

Use this behavior guideline:
<paste guideline>

Generate 20 JSONL records in messages format.
Requirements:
- Vietnamese with proper diacritics.
- Include metadata fields: id, domain, task, source, quality_score, pii_status, license, group_id, messages.
- source must be synthetic_reviewed.
- pii_status must be clean.
- license must be internal_training_allowed.
- Vary user wording, emotion, missing information, and escalation cases.
- Assistant must not promise refund, account unlock, data deletion, or billing changes without verification.
- Output valid JSONL only, one object per line.

Review sau khi sinh:

Loại record lặp template quá rõ.
Loại response dài lan man.
Loại response invent policy, SLA, nguyên nhân lỗi.
Loại record thiếu escalation khi có security/billing/legal risk.
Loại record có PII thật.

Phần 4: Validate, redact, dedup và split

Tạo thư mục:

mkdir -p instruction_dataset/raw instruction_dataset/processed

Lưu toàn bộ JSONL vào:

instruction_dataset/raw/input.jsonl

Dùng script trong document.md và chạy:

cd instruction_dataset
python3 prepare_dataset.py --input raw/input.jsonl --out-dir processed --dataset-name support_instruction_v1

Kiểm tra output:

wc -l processed/dataset_split.jsonl
sed -n '1,120p' processed/validation_report.md
sed -n '1,120p' processed/dataset_card.md

Dataset chỉ đạt khi:

Có khoảng 500 rows sau dedup/filter.
errors: 0.
Không có record needs_review.
Có đủ train/validation/test.
Split không trộn cùng group_id.
Dataset card không còn placeholder quan trọng.

Phần 5: Tạo golden set cho Day 28

Chọn 20-50 examples từ test split làm golden set. Không dùng golden set để sửa prompt hoặc tune hyperparameter lặp lại.

Golden set nên có:

Case bình thường.
Case thiếu thông tin.
Case user tức giận.
Case billing/security/legal cần escalate.
Case có PII đã redact.
Case assistant phải từ chối an toàn.
Case cần output format cố định.

Phần 6: Câu hỏi bắt buộc

Trả lời trong validation_report.md hoặc file riêng:

Dataset này dùng để dạy behavior gì, không dạy knowledge gì?
Vì sao bạn chọn Alpaca, ShareGPT hoặc messages format?
Có bao nhiêu rows từ human seed, synthetic reviewed, production redacted?
Bạn đã làm gì để tránh PII/secret?
Bạn dedup trước hay sau split? Vì sao?
Split có theo group_id không?
Dataset này dùng được trong production không? Nếu có thì cần điều kiện gì?
Rủi ro còn lại lớn nhất là gì?

Rubric tự chấm

Hạng mục	Điểm
Guideline rõ, đúng domain, có policy an toàn	15
Schema đầy đủ metadata	15
500 examples đa dạng, quality_score hợp lý	20
Cleaning/redaction/dedup/split chạy được	20
Dataset card và metadata đầy đủ	10
Golden set chuẩn bị tốt cho Day 28	10
Trả lời production readiness và trade-off rõ	10

Tổng: 100 điểm.

Đáp án mẫu ngắn cho production readiness

Dùng được trong production nếu dataset có quyền train/deploy, không còn PII/secret, đã validate schema, dedup trước split, split theo group, synthetic data đã review, có dataset card/metadata và pass evaluation trước/sau fine-tune. Nếu chỉ là synthetic data sinh nhanh chưa review hoặc còn needs_review, chỉ dùng cho lab.