Day 3: ML Fundamentals

Mục tiêu

Sau Day 1, bạn đã biết khi nào nên dùng rule, ML, RAG hoặc LLM. Sau Day 2, bạn đã có trực giác về vector, matrix, gradient và probability. Day 3 nối hai phần đó thành workflow ML thực tế: biến dữ liệu thành một model có thể đánh giá, so sánh và cân nhắc deploy.

Sau bài này, bạn cần làm được:

Phân biệt supervised, unsupervised và reinforcement learning theo góc nhìn application/system.
Nhận diện regression, classification, ranking và clustering.
Thiết kế train/validation/test split không gây data leakage.
Hiểu cross-validation dùng để làm gì và khi nào không nên dùng.
Giải thích overfitting, underfitting và bias-variance trade-off bằng ngôn ngữ production.
Chọn baseline model trước khi thử model phức tạp.
Trả lời câu hỏi: Dùng được trong production không? Nếu có thì cần điều kiện gì?

TL;DR

Machine Learning là cách xây một function từ data thay vì viết toàn bộ rule bằng tay. Với Senior SE, hãy nhìn model như một dependency có contract xác suất: input giống nhau có thể ổn định, nhưng quality phụ thuộc vào training data, feature, metric, split strategy và distribution ngoài production.

Production ML không bắt đầu bằng model xịn. Nó bắt đầu bằng:

Định nghĩa business objective.
Chọn metric đúng.
Tạo baseline đơn giản.
Split data đúng kỷ luật.
So sánh model bằng experiment có thể lặp lại.
Kiểm tra leakage, latency, drift và fallback trước khi deploy.

1. Map ML Về Tư Duy Senior SE

Trong backend truyền thống:

input -> code/rule do engineer viết -> output

Trong ML:

historical data + labels + algorithm -> training -> model artifact
new input -> model artifact -> prediction

So sánh nhanh:

Software engineering	Machine learning
Code logic	Model parameters học từ data
Config	Hyperparameters
Unit/integration test	Offline evaluation
Staging	Validation/test set
Runtime API	Inference service
Logging/metrics	Prediction, confidence, drift, latency
Regression bug	Quality regression hoặc data drift

Điểm khác biệt quan trọng: model không đúng/sai tuyệt đối như rule. Model có error rate. Vì vậy release decision phải dựa trên metric, threshold và risk tolerance.

2. Các Loại Bài Toán ML

2.1. Supervised Learning

Supervised learning dùng data có label:

features -> label

Ví dụ:

Fraud detection: giao dịch có fraud hay không.
Customer churn prediction: khách hàng có rời bỏ trong 30 ngày tới không.
Lead scoring: lead có khả năng mua hàng không.
Ticket classification: ticket thuộc nhóm billing, technical hay account.
ETA prediction: thời gian giao hàng dự kiến.

Hai nhánh phổ biến:

Loại	Output	Ví dụ	Metric thường gặp
Regression	Số liên tục	Giá nhà, ETA, doanh thu	MAE, RMSE, R2
Classification	Nhãn rời rạc hoặc probability	Churn, fraud, spam	Precision, recall, F1, ROC-AUC, PR-AUC

Nếu output là probability, thường vẫn là classification. Ví dụ P(churn = 1) = 0.82, sau đó dùng threshold để quyết định action.

2.2. Unsupervised Learning

Unsupervised learning không có label rõ ràng. Model tìm structure/pattern trong data:

Customer segmentation.
Anomaly detection.
Topic clustering.
Embedding visualization.
Duplicate detection.

Production challenge lớn nhất là evaluation khó hơn supervised learning. Không có label thì không thể chỉ nói “accuracy 95%”. Bạn cần proxy metric, human review, downstream metric hoặc A/B test.

2.3. Reinforcement Learning

Reinforcement learning có agent, action, environment và reward:

state -> action -> reward -> policy update

Trong production application thông thường, RL ít là lựa chọn đầu tiên vì exploration có thể ảnh hưởng user thật. Trước khi dùng RL, thường cần offline simulation, guardrails, A/B testing hoặc bandit strategy. Với course này, bạn chỉ cần hiểu trực giác để không nhầm RL với supervised learning.

3. Cách Nhận Diện Problem Type

Hãy bắt đầu bằng câu hỏi về output và decision cần đưa ra:

Câu hỏi	Loại bài toán khả dĩ
Output là số liên tục?	Regression
Output là class rời rạc?	Classification
Output là probability của event?	Classification + thresholding
Output là danh sách được sắp xếp?	Ranking/recommendation
Output là nhóm không có label trước?	Clustering
Output là text dài, reasoning hoặc synthesis?	LLM/generation
Cần tìm tài liệu liên quan rồi trả lời?	RAG

Best solution phụ thuộc context. Ví dụ customer support routing có thể dùng rule nếu chỉ có 5 category rõ ràng. Khi ticket đa dạng, rule phình to và khó maintain, supervised classification hợp lý hơn. Nếu user hỏi tự do trên knowledge base, RAG hoặc LLM app mới phù hợp.

4. Train, Validation, Test Split

Trong ML, split data là phần tương đương với test discipline trong software. Nếu split sai, metric đẹp nhưng production fail.

train set      -> model học parameters
validation set -> chọn model, tune hyperparameter, tune threshold
test set       -> đánh giá cuối trước release
production     -> dữ liệu thật, có drift và edge cases mới

4.1. Vai trò từng tập

Train set: dùng để fit model.
Validation set: dùng để chọn model family, hyperparameter, feature set, threshold.
Test set: chỉ dùng ít lần ở cuối để estimate quality trên unseen data.

Nếu bạn dùng test set để tune nhiều lần, test set đã trở thành validation set. Khi đó metric test không còn đáng tin.

4.2. Random split

Random split phù hợp khi data độc lập tương đối và không có timeline quan trọng. Ví dụ phân loại ảnh sản phẩm nếu ảnh đã được deduplicate tốt.

Với classification mất cân bằng, nên dùng stratified split để tỷ lệ class giữa train/test gần nhau.

4.3. Time-based split

Với bài toán có thời gian, production thường dự đoán tương lai từ quá khứ. Split nên mô phỏng đúng runtime:

train:      tháng 01-03
validation: tháng 04
test:       tháng 05
production: tháng 06 trở đi

Ví dụ churn, fraud, demand forecasting, lead scoring, credit risk thường nên cân nhắc time-based split. Random split có thể làm model nhìn thấy pattern tương lai một cách gián tiếp.

4.4. Group-based split

Nếu cùng một user/order/session xuất hiện nhiều dòng, phải tránh để cùng group nằm cả train và test. Nếu không, model có thể nhớ đặc điểm user thay vì học pattern tổng quát.

Ví dụ:

Nhiều transaction của cùng một customer.
Nhiều event của cùng một device.
Nhiều review của cùng một product.

5. Data Leakage

Data leakage xảy ra khi training/evaluation dùng thông tin mà production không có tại thời điểm dự đoán.

Ví dụ:

Dự đoán churn hôm nay nhưng feature dùng cancelled_at.
Dự đoán fraud trước khi approve transaction nhưng feature dùng chargeback result sau 30 ngày.
Scale/impute/feature-select trên toàn bộ dataset trước khi split.
Duplicate user/session nằm cả train và test.

Checklist audit leakage:

Feature này có tồn tại tại prediction time không?
Feature này được tính bằng dữ liệu tương lai không?
Preprocessing có fit trên test set không?
Có duplicate hoặc near-duplicate giữa train/test không?
Label có bị encode gián tiếp trong feature không?

Trong scikit-learn, dùng Pipeline giúp giảm leakage vì preprocessing như StandardScaler được fit trên train fold trong cross-validation, sau đó mới transform validation/test fold.

6. Cross-Validation

Cross-validation chia data thành nhiều fold, train/evaluate nhiều lần rồi lấy mean/std metric.

Ví dụ 5-fold:

run 1: fold 1 test, fold 2-5 train
run 2: fold 2 test, fold 1,3,4,5 train
...
run 5: fold 5 test, fold 1-4 train

Nên dùng khi:

Dataset nhỏ/vừa.
Cần so sánh model truyền thống.
Muốn metric ít phụ thuộc vào một lần split.
Cần phát hiện variance giữa folds.

Không nên lạm dụng khi:

Dataset rất lớn, training cost cao.
Bài toán time-series cần split theo thời gian.
Deep learning/LLM fine-tuning tốn compute.
Dataset có group leakage nhưng lại dùng K-fold thường.

Trade-off: cross-validation cho estimate ổn định hơn nhưng training time tăng gần tuyến tính theo số fold. 5-fold tức là fit khoảng 5 lần.

7. Overfitting, Underfitting, Bias-Variance

7.1. Underfitting

Underfitting xảy ra khi model quá đơn giản hoặc feature quá nghèo:

train score thấp
validation score thấp

Cách xử lý:

Thêm feature có signal.
Dùng model expressive hơn.
Giảm regularization nếu đang quá mạnh.
Kiểm tra label noise hoặc problem framing.

7.2. Overfitting

Overfitting xảy ra khi model học cả noise/exception của training data:

train score cao
validation score thấp hơn rõ rệt

Cách xử lý:

Thêm data.
Giảm model complexity.
Tăng regularization.
Prune tree, giới hạn depth, giảm số feature.
Dùng cross-validation.
Làm error analysis để phân biệt overfit với leakage.

7.3. Bias-Variance Trade-off

Tình huống	Train score	Validation score	Vấn đề	Hướng xử lý
High bias	Thấp	Thấp	Underfitting	Thêm feature, model mạnh hơn
High variance	Cao	Thấp	Overfitting	Regularization, thêm data, giảm complexity
Good fit	Tốt vừa	Tốt tương đương	Ổn	Error analysis, threshold tuning
Suspicious fit	Gần 100%	Gần 100%	Có thể leakage	Audit feature/split

Lưu ý: score “quá đẹp” không luôn là tin tốt. Với data thực tế, F1/ROC-AUC gần hoàn hảo thường cần audit leakage trước khi ăn mừng.

8. Baseline-First Mindset

Baseline là mốc tối thiểu để biết model có tạo giá trị không.

Baseline phổ biến:

Classification mất cân bằng: predict majority class hoặc DummyClassifier.
Regression: predict mean/median.
Tabular classification: Logistic Regression.
Tabular data non-linear: Random Forest hoặc Gradient Boosting.
Text classification: TF-IDF + Logistic Regression.
Retrieval/RAG: BM25 trước embedding/reranking.

Rule production:

Không deploy model nếu chưa vượt baseline theo metric gắn với business objective.

Ví dụ fraud detection có 1% fraud. Accuracy 99% có thể chỉ là model luôn predict “not fraud”. Baseline này vô dụng về business nếu recall fraud bằng 0.

9. Metric Phải Gắn Với Business

Không có metric đúng cho mọi bài toán.

Bài toán	Sai lầm đắt hơn	Metric nên ưu tiên
Fraud detection	Bỏ sót fraud	Recall, PR-AUC, cost-weighted metric
Spam detection	Chặn nhầm email tốt	Precision cho class spam
Churn campaign	Gửi offer sai người	Precision/recall theo campaign budget
Medical screening	Bỏ sót ca bệnh	Recall/sensitivity, calibration
Lead scoring	Sales gọi nhầm quá nhiều	Precision@K, lift

Threshold là business decision, không chỉ là model decision. Model có thể output probability, còn threshold phụ thuộc cost false positive/false negative, capacity vận hành và risk tolerance.

10. Workflow Experiment Gần Production

Một experiment tối thiểu nên có:

Problem statement: dự đoán gì, tại thời điểm nào, dùng để quyết định gì.
Dataset version hoặc source rõ ràng.
Split strategy: random, stratified, time-based hoặc group-based.
Baseline: dummy/majority/mean.
Candidate models: ít nhất một model đơn giản và một model mạnh hơn.
Metrics: business metric + technical metric.
Reproducibility: random_state, dependency version, params.
Latency/fit time: đủ để cân nhắc production.
Error analysis: model sai ở nhóm nào.
Release decision: deploy, không deploy, hoặc cần thêm data.

11. Production Concerns

Train-serving skew

Training pipeline và inference pipeline phải xử lý feature giống nhau. Nếu notebook encode category một kiểu nhưng service encode kiểu khác, metric offline không còn ý nghĩa.

Giải pháp: đóng gói preprocessing và model trong cùng Pipeline hoặc cùng feature pipeline có version.

Reproducibility

Cần lưu:

Code version.
Dataset snapshot.
Feature definitions.
Model params/hyperparams.
Metrics.
Random seed.
Artifact version.

Monitoring

Sau deploy, theo dõi:

Latency p50/p95/p99.
Error rate của service.
Prediction distribution.
Feature distribution.
Confidence/probability distribution.
Business KPI downstream.
Data drift và concept drift.

Fallback

Model service cần fallback rõ:

Nếu model timeout thì dùng rule cũ?
Nếu feature thiếu thì reject request hay degrade?
Nếu confidence thấp thì route sang human review?
Nếu drift mạnh thì rollback model version nào?

12. Dùng Được Trong Production Không?

Có, nhưng không phải chỉ với một notebook và một metric đẹp.

Day 3 đủ làm nền cho production nếu thỏa các điều kiện:

Problem framing rõ: model dự đoán gì, tại thời điểm nào, để hỗ trợ decision nào.
Split strategy mô phỏng đúng production, đặc biệt với time-dependent data.
Có baseline và model candidate vượt baseline theo metric business.
Preprocessing nằm trong Pipeline hoặc feature pipeline versioned để tránh train-serving skew.
Test set độc lập, không bị dùng để tune lặp lại.
Có audit data leakage.
Có latency/memory estimate cho inference path.
Có logging, monitoring, rollback và fallback.
Có quy trình retraining hoặc ít nhất là drift review định kỳ.

Nếu thiếu các điều kiện trên, bài này vẫn dùng tốt cho prototype/offline analysis, nhưng chưa đủ để deploy vào production có user thật.

13. Checklist Tự Kiểm

14. Kết Nối Sang Day 4

Day 4 sẽ đi vào Python ML Stack: NumPy, Pandas, scikit-learn, notebook workflow và cách hiện thực pipeline cơ bản. Day 3 cho bạn “kỷ luật đánh giá”; Day 4 cho bạn công cụ để chạy kỷ luật đó bằng code.

Trước khi sang Day 4, hãy làm exercise.md. Bài thực hành sẽ dùng scikit-learn để so sánh baseline, Logistic Regression, Random Forest và HistGradientBoosting với split, metrics và timing rõ ràng.

Tài liệu

Tài liệu này dùng như checklist khi chọn model cho bài toán ML nền tảng. Không cần học công thức sâu ngay hôm nay; mục tiêu là hiểu model nào nên thử trước, trade-off là gì và production concern nằm ở đâu.

1. Nguyên tắc Chọn Model

Thứ tự ưu tiên thực tế:

Bắt đầu bằng baseline đơn giản.
Chọn metric theo business cost.
Chọn model đơn giản nhất vượt baseline đủ tốt.
Chỉ tăng complexity khi metric hoặc requirement bắt buộc.
Tính cả latency, explainability, cost vận hành và khả năng debug.

Không có “best model” tuyệt đối. Best solution phụ thuộc vào data size, feature type, latency budget, explainability, team skill và cost sai lầm.

2. Model Cho Regression

Linear Regression

Ý tưởng: học quan hệ gần tuyến tính giữa features và target.

Nên dùng khi:

Cần baseline nhanh.
Cần explainability.
Feature đã được xử lý tốt.
Quan hệ giữa input và output tương đối tuyến tính.

Không nên dùng khi:

Pattern phi tuyến mạnh.
Có interaction phức tạp.
Outlier ảnh hưởng lớn nhưng chưa xử lý.

Production note:

Inference rất nhanh.
Dễ monitoring vì behavior đơn giản.
Cần kiểm tra feature scaling, outlier và drift.

Tree-based Regression

Decision Tree, Random Forest, Gradient Boosting có thể học non-linear pattern tốt hơn linear model.

Trade-off:

Tốt cho tabular data.
Ít yêu cầu scaling.
Dễ overfit nếu tree quá sâu.
Artifact và latency lớn hơn linear model.

3. Model Cho Classification

DummyClassifier

Baseline không học signal thật, ví dụ luôn predict class phổ biến nhất.

Mục đích:

Đo xem model thật có vượt “đoán ngu có hệ thống” không.
Phát hiện metric misleading, đặc biệt khi class imbalance.

Nếu model phức tạp không vượt DummyClassifier theo metric business, chưa nên deploy.

Logistic Regression

Logistic Regression là baseline mạnh cho classification, đặc biệt với tabular feature đã clean hoặc text sparse feature như TF-IDF.

Nên dùng khi:

Cần baseline nhanh, ổn định.
Cần probability tương đối dễ calibrate.
Cần explainability tốt hơn tree ensemble.
Cần latency thấp.
Dataset không quá phi tuyến.

Không nên dùng khi:

Pattern phi tuyến/interactions rất mạnh.
Feature engineering chưa đủ.
Boundary giữa classes phức tạp.

Production note:

Thường là lựa chọn v1 tốt cho API traffic cao.
Cần scaling cho nhiều trường hợp numerical feature.
Regularization giúp giảm overfitting.

Decision Tree

Decision Tree chia data theo rule dạng if/else.

Nên dùng khi:

Cần giải thích trực quan.
Muốn hiểu feature split ban đầu.
Dataset nhỏ/vừa.

Không nên dùng khi:

Cần generalization mạnh.
Data noisy.
Không kiểm soát depth/min samples.

Production note:

Tree đơn lẻ dễ overfit.
Shallow tree có thể tốt như rule engine học từ data.

Random Forest

Random Forest train nhiều decision trees rồi ensemble kết quả.

Nên dùng khi:

Tabular data có non-linear pattern.
Muốn model mạnh hơn Logistic Regression nhưng ít tuning hơn boosting.
Dataset vừa, latency không quá chặt.

Không nên dùng khi:

Cần inference cực thấp.
Model artifact size bị giới hạn.
Cần explainability rất cao.

Production note:

Cost tăng theo n_estimators * depth.
Set random_state để reproducible.
Có thể dùng n_jobs=-1 khi training offline, nhưng cần kiểm soát tài nguyên CI/job runner.

Gradient Boosting, HistGradientBoosting, XGBoost, LightGBM

Boosting train model theo chuỗi, model sau sửa lỗi model trước. Với tabular data, boosting thường rất mạnh.

Nên dùng khi:

Cần quality cao trên tabular data.
Có đủ discipline về validation.
Có thời gian tune hyperparameters.

Không nên dùng khi:

Team chưa có baseline/evaluation rõ.
Data ít và noisy.
Latency hoặc complexity vận hành không phù hợp.

Production note:

Dễ overfit nếu tune theo validation quá nhiều.
Cần version params và dataset.
Nên theo dõi calibration nếu output probability dùng cho decision cost-sensitive.

SVM

SVM hữu ích cho dataset nhỏ/vừa, đặc biệt với high-dimensional sparse features.

Trade-off:

Có thể mạnh với text sparse feature.
Kernel SVM scale kém với data lớn.
Probability output không tự nhiên bằng Logistic Regression.

KNN

KNN dự đoán dựa trên các điểm gần nhất.

Nên dùng khi:

Prototype similarity.
Dataset nhỏ.
Muốn baseline trực giác.

Không nên dùng khi:

Realtime inference trên dataset lớn.
Feature dimension cao mà không có index tốt.

Production note:

Naive inference gần O(n) theo số training samples.
Ở scale lớn nên chuyển sang ANN/vector index hoặc model khác.

4. Model Selection Theo Context

Context	Nên thử trước	Vì sao
Tabular classification v1	DummyClassifier + Logistic Regression	Baseline rõ, nhanh, dễ debug
Tabular non-linear	Random Forest hoặc HistGradientBoosting	Mạnh hơn linear, hợp tabular
Text classification ngắn	TF-IDF + Logistic Regression	Baseline mạnh trước Transformer
Fraud/churn mất cân bằng	Logistic Regression/boosting + PR-AUC/recall	Accuracy dễ đánh lừa
Latency cực thấp	Logistic Regression	Inference nhẹ
Explainability cao	Logistic Regression, shallow tree	Dễ giải thích hơn ensemble lớn
Data rất ít	Rule hoặc simple model	Model phức tạp dễ overfit
Language generation	LLM	ML classifier không sinh text tự do
Search tài liệu nội bộ	BM25/RAG	Classification không giải quyết retrieval

5. Split Strategy Reference

Tình huống	Split nên dùng	Lý do
Data IID tương đối	Random split + stratify nếu classification	Đơn giản, nhanh
Class imbalance	Stratified split	Giữ tỷ lệ class
Dữ liệu có timeline	Time-based split	Mô phỏng production
Nhiều dòng cùng user/session	Group split	Tránh cùng entity vào train và test
Dataset nhỏ	Cross-validation	Metric ổn định hơn
Training rất đắt	Holdout validation	Giảm compute

6. Metric Reference

Classification

Metric	Ý nghĩa	Khi dùng
Accuracy	Tỷ lệ dự đoán đúng	Class cân bằng, cost sai lầm gần nhau
Precision	Trong các positive prediction, bao nhiêu đúng	False positive đắt
Recall	Trong actual positive, bắt được bao nhiêu	False negative đắt
F1	Trung hòa precision và recall	Cần cân bằng hai phía
ROC-AUC	Khả năng rank positive cao hơn negative	Binary classification tổng quát
PR-AUC	Precision/recall trên nhiều threshold	Class imbalance
Log loss	Phạt probability sai/confident	Cần probability quality

Regression

Metric	Ý nghĩa	Khi dùng
MAE	Sai số tuyệt đối trung bình	Dễ hiểu, ít nhạy outlier hơn RMSE
RMSE	Phạt lỗi lớn mạnh hơn	Lỗi lớn rất đắt
R2	Tỷ lệ variance được giải thích	So sánh nhanh, không đủ một mình

7. Production Checklist Cho Model Selection

Baseline là gì và score bao nhiêu?
Candidate model vượt baseline theo metric nào?
Metric đó có gắn với business cost không?
Split có phản ánh production data flow không?
Có data leakage không?
Model artifact size bao nhiêu?
Inference latency p95/p99 dự kiến bao nhiêu?
Feature computation có đắt hơn model inference không?
Có fallback khi model timeout/confidence thấp không?
Có monitoring prediction distribution và drift không?
Có plan retraining hoặc rollback không?

8. Anti-patterns

Chọn model vì “nghe mạnh” trước khi có baseline.
Dùng accuracy cho fraud/churn imbalance rồi kết luận model tốt.
Scale/encode/feature-select trước khi split.
Tune test set nhiều lần.
Random split cho bài toán theo thời gian.
Deploy notebook logic khác với inference service.
Không lưu dataset/code/model version.
Không đo latency trước khi đưa vào API.

9. Kết Luận

Trong giai đoạn đầu của một AI Engineer, kỹ năng quan trọng không phải nhớ mọi thuật toán. Kỹ năng quan trọng là chọn được model đủ đơn giản, đánh giá đúng, tránh leakage và biết khi nào complexity tạo giá trị thật. Model tốt nhất trong production thường là model đơn giản nhất đạt yêu cầu business với chi phí vận hành chấp nhận được.

Bài tập

Mục tiêu thực hành

Bài này giúp bạn chạy một experiment ML gần production hơn toy example:

Có baseline bằng DummyClassifier.
Có split reproducible với random_state và stratify.
Có Pipeline để tránh preprocessing leakage.
Có nhiều metrics: accuracy, precision, recall, F1, ROC-AUC.
Có đo training time và prediction time.
Có cross-validation để xem metric ổn định không.
Có câu hỏi production decision ở cuối.

Dataset dùng load_breast_cancer vì có sẵn trong scikit-learn, chạy nhanh và đủ để thực hành binary classification. Trong project thật, bạn thay bằng dataset churn/fraud/internal ticket nhưng giữ nguyên discipline.

1. Setup

python3 -m venv .venv
source .venv/bin/activate
pip install numpy pandas scikit-learn

Kiểm tra version:

python3 -c "import sklearn; print(sklearn.__version__)"

2. Experiment Script

Tạo file tạm, ví dụ day03_experiment.py, rồi chạy script dưới đây. Script này không lưu artifact để giữ bài tập gọn; trong production bạn sẽ lưu model bằng format phù hợp, kèm metadata về dataset/code/params/metrics.

from __future__ import annotations

import time
from dataclasses import dataclass

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import HistGradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score,
    f1_score,
    precision_score,
    recall_score,
    roc_auc_score,
)
from sklearn.model_selection import cross_validate, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler


RANDOM_STATE = 42
TEST_SIZE = 0.2


@dataclass(frozen=True)
class ModelSpec:
    name: str
    estimator: object
    has_probability: bool = True


def build_models() -> list[ModelSpec]:
    return [
        ModelSpec(
            name="dummy_majority",
            estimator=DummyClassifier(strategy="most_frequent"),
            has_probability=True,
        ),
        ModelSpec(
            name="logistic_regression",
            estimator=Pipeline(
                steps=[
                    ("scaler", StandardScaler()),
                    (
                        "model",
                        LogisticRegression(
                            max_iter=1000,
                            class_weight=None,
                            random_state=RANDOM_STATE,
                        ),
                    ),
                ]
            ),
        ),
        ModelSpec(
            name="random_forest",
            estimator=RandomForestClassifier(
                n_estimators=300,
                max_depth=6,
                min_samples_leaf=3,
                random_state=RANDOM_STATE,
                n_jobs=-1,
            ),
        ),
        ModelSpec(
            name="hist_gradient_boosting",
            estimator=HistGradientBoostingClassifier(
                max_iter=200,
                learning_rate=0.05,
                max_leaf_nodes=15,
                l2_regularization=0.1,
                random_state=RANDOM_STATE,
            ),
        ),
    ]


def predict_score(estimator, X_test: pd.DataFrame) -> np.ndarray:
    if hasattr(estimator, "predict_proba"):
        return estimator.predict_proba(X_test)[:, 1]
    if hasattr(estimator, "decision_function"):
        return estimator.decision_function(X_test)
    raise ValueError("Estimator must expose predict_proba or decision_function.")


def evaluate_holdout(
    name: str,
    estimator,
    X_train: pd.DataFrame,
    X_test: pd.DataFrame,
    y_train: pd.Series,
    y_test: pd.Series,
) -> dict[str, float | str]:
    train_start = time.perf_counter()
    estimator.fit(X_train, y_train)
    train_ms = (time.perf_counter() - train_start) * 1000

    predict_start = time.perf_counter()
    y_pred = estimator.predict(X_test)
    predict_ms = (time.perf_counter() - predict_start) * 1000

    y_score = predict_score(estimator, X_test)

    return {
        "model": name,
        "accuracy": accuracy_score(y_test, y_pred),
        "precision": precision_score(y_test, y_pred, zero_division=0),
        "recall": recall_score(y_test, y_pred, zero_division=0),
        "f1": f1_score(y_test, y_pred, zero_division=0),
        "roc_auc": roc_auc_score(y_test, y_score),
        "train_ms": train_ms,
        "predict_ms_per_1000_rows": predict_ms / len(X_test) * 1000,
    }


def evaluate_cross_validation(name: str, estimator, X: pd.DataFrame, y: pd.Series) -> dict[str, float | str]:
    scoring = {
        "accuracy": "accuracy",
        "precision": "precision",
        "recall": "recall",
        "f1": "f1",
        "roc_auc": "roc_auc",
    }
    result = cross_validate(
        estimator,
        X,
        y,
        cv=5,
        scoring=scoring,
        return_train_score=True,
        n_jobs=None,
    )

    row: dict[str, float | str] = {"model": name}
    for metric in scoring:
        test_values = result[f"test_{metric}"]
        train_values = result[f"train_{metric}"]
        row[f"cv_test_{metric}_mean"] = float(np.mean(test_values))
        row[f"cv_test_{metric}_std"] = float(np.std(test_values))
        row[f"cv_train_{metric}_mean"] = float(np.mean(train_values))
    row["cv_fit_time_ms_mean"] = float(np.mean(result["fit_time"]) * 1000)
    return row


def main() -> None:
    dataset = load_breast_cancer(as_frame=True)
    X = dataset.data
    y = dataset.target

    print("Dataset shape:", X.shape)
    print("Class distribution:")
    print(y.value_counts(normalize=True).sort_index().rename("ratio"))

    X_train, X_test, y_train, y_test = train_test_split(
        X,
        y,
        test_size=TEST_SIZE,
        random_state=RANDOM_STATE,
        stratify=y,
    )

    holdout_rows = []
    cv_rows = []
    for spec in build_models():
        holdout_rows.append(
            evaluate_holdout(spec.name, spec.estimator, X_train, X_test, y_train, y_test)
        )
        cv_rows.append(evaluate_cross_validation(spec.name, spec.estimator, X, y))

    holdout_df = pd.DataFrame(holdout_rows).sort_values("f1", ascending=False)
    cv_df = pd.DataFrame(cv_rows).sort_values("cv_test_f1_mean", ascending=False)

    print("\nHoldout metrics:")
    print(holdout_df.round(4).to_string(index=False))

    print("\nCross-validation metrics:")
    selected_columns = [
        "model",
        "cv_test_f1_mean",
        "cv_test_f1_std",
        "cv_train_f1_mean",
        "cv_test_roc_auc_mean",
        "cv_fit_time_ms_mean",
    ]
    print(cv_df[selected_columns].round(4).to_string(index=False))


if __name__ == "__main__":
    main()

3. Vì Sao Script Này Gần Production Hơn?

Baseline rõ ràng

DummyClassifier(strategy="most_frequent") cho biết nếu chỉ đoán class phổ biến nhất thì metric ra sao. Candidate model phải vượt baseline này theo metric quan trọng.

Split có kỷ luật

train_test_split(..., stratify=y, random_state=RANDOM_STATE) giúp:

Giữ tỷ lệ class giữa train/test.
Có thể reproduce kết quả.
Tránh mỗi lần chạy ra một kết luận khác nhau.

Với bài toán có timeline thật, hãy thay random split bằng time-based split.

Pipeline giảm leakage

Logistic Regression dùng:

Pipeline(
    steps=[
        ("scaler", StandardScaler()),
        ("model", LogisticRegression(...)),
    ]
)

Scaler được fit cùng training fold, không fit trước trên toàn dataset. Đây là pattern quan trọng để tránh preprocessing leakage.

Metrics không chỉ accuracy

Script in accuracy, precision, recall, F1 và ROC-AUC. Với class imbalance, accuracy thường đánh lừa. Trong production, chọn metric theo cost:

False negative đắt: ưu tiên recall.
False positive đắt: ưu tiên precision.
Cần rank case rủi ro: xem ROC-AUC/PR-AUC.
Cần action theo budget: xem precision@K hoặc recall@K.

Có timing

train_ms và predict_ms_per_1000_rows chưa thay thế load test thật, nhưng giúp bạn bắt đầu nghĩ về performance. Model tốt hơn 0.5% F1 nhưng inference chậm gấp 20 lần chưa chắc là best solution.

4. Nhiệm Vụ Cần Làm

Task 1: Đọc kết quả baseline

Ghi lại:

dummy_majority có accuracy bao nhiêu?
Precision/recall/F1 có nói cùng một câu chuyện không?
Nếu chỉ nhìn accuracy, bạn có bị đánh lừa không?

Task 2: So sánh model

Tạo bảng quyết định:

Model	F1	Recall	ROC-AUC	Predict time	Nhận xét
dummy_majority
logistic_regression
random_forest
hist_gradient_boosting

Trả lời:

Model nào tốt nhất theo F1?
Model nào tốt nhất theo recall?
Model nào có trade-off latency/quality tốt nhất?
Nếu API cần p99 thấp, bạn có chọn model top metric không?

Task 3: Nhìn train vs validation trong cross-validation

So sánh:

cv_train_f1_mean vs cv_test_f1_mean

Nếu train cao hơn test nhiều, có thể có overfitting. Nếu cả hai đều thấp, có thể underfitting hoặc feature chưa đủ signal.

Task 4: Thử thay đổi constraint

Chạy lại với:

RandomForestClassifier(max_depth=None).
RandomForestClassifier(max_depth=3).
HistGradientBoostingClassifier(max_iter=50).
LogisticRegression(C=0.1) và C=10.

Ghi lại model nào overfit hơn, model nào train nhanh hơn và model nào ổn định hơn qua folds.

Task 5: Production decision

Viết decision memo ngắn:

Recommendation:
- Deploy / do not deploy / need more data.

Reason:
- Baseline:
- Best candidate:
- Metric chosen:
- Latency concern:
- Leakage risk:
- Monitoring needed:
- Fallback:

5. Câu Hỏi Bắt Buộc

Dùng được trong production không? Nếu có thì cần điều kiện gì?

Script này chưa đủ để production trực tiếp, nhưng workflow của nó dùng được làm nền production nếu bổ sung:

Dataset thật có version và data contract.
Split strategy đúng với production, ví dụ time-based cho churn/fraud.
Feature pipeline dùng chung giữa training và inference.
Test set độc lập, không bị tune lặp lại.
Threshold được chọn theo business cost, không mặc định 0.5.
Model artifact được version cùng params, code hash và metrics.
Có load test cho inference service, không chỉ timing trong script.
Có monitoring drift, prediction distribution, latency và downstream KPI.
Có fallback/rollback khi model lỗi hoặc quality giảm.

6. Câu Hỏi Tự Kiểm

Vì sao không nên fit StandardScaler trên toàn bộ dataset trước khi split?
Vì sao stratify=y hữu ích trong classification?
Khi nào random split không đáng tin?
Vì sao baseline majority class có thể có accuracy cao nhưng vô dụng?
Cross-validation giúp gì so với một holdout split?
Nếu model A F1 cao hơn model B 0.01 nhưng latency gấp 10 lần, bạn chọn model nào?
Nếu test score gần 100%, bạn kiểm tra gì trước?
Trong fraud detection, false positive và false negative ảnh hưởng business khác nhau thế nào?

7. Deliverable

Sau khi hoàn thành, bạn nên có:

Output metrics từ script.
Một bảng so sánh model.
Một decision memo ngắn.
Một danh sách rủi ro trước production.

Đây là artifact nhỏ nhưng đúng tinh thần của course: concept vừa đủ, hands-on thực tế, trade-off rõ và production decision cụ thể.