Day 6: Model Evaluation Metrics

Mục Tiêu

Day 3 đã nói về train/validation/test split và overfitting. Day 4-5 đã đưa bạn đến scikit-learn stack và feature engineering. Day 6 trả lời câu hỏi tiếp theo: model đã train xong thì đánh giá thế nào để không chọn nhầm model đẹp trên notebook nhưng gây thiệt hại trong production?

Kết thúc bài này, bạn cần làm được:

Giải thích được accuracy, precision, recall, F1, ROC-AUC, PR-AUC/average precision và confusion matrix bằng ngôn ngữ business.
Chọn metric đúng cho imbalanced classification, đặc biệt là fraud detection.
Hiểu và dùng đúng regression metrics: MAE, MSE, RMSE, MAPE.
Hiểu ranking metrics: MRR, NDCG, Recall@k, và vì sao chúng quan trọng cho recommendation/search/RAG.
Tách bạch ML metric với business metric, rồi nối chúng bằng cost, profit, SLA, capacity và risk.
Thiết kế evaluation workflow gần production: fixed test set, threshold sweep, segment metrics, monitoring, drift và calibration.
Trả lời rõ: "Dùng được trong production không? Nếu có thì cần điều kiện gì?"

TL;DR

Evaluation metric là test suite của ML system, nhưng khác unit test ở chỗ output thường là xác suất và quyết định phụ thuộc business context. Không có "metric tốt nhất" cho mọi bài toán. Accuracy chỉ đáng tin khi class tương đối cân bằng và cost của các loại lỗi gần nhau. Với positive class hiếm như fraud, PR-AUC, recall, precision tại threshold cụ thể và confusion matrix thường quan trọng hơn accuracy.

Best default khi đánh giá một model tabular classification:

Xác định positive class và action sau prediction
-> xem class distribution
-> train baseline đơn giản
-> báo cáo ROC-AUC và PR-AUC trên score
-> sweep threshold
-> chọn threshold theo cost/capacity/guardrail
-> kiểm tra confusion matrix theo segment
-> log metric offline và online sau deploy

1. Metric Không Chỉ Là Công Thức

Metric là cách bạn biến một mục tiêu mơ hồ như "model tốt hơn" thành tiêu chí ra quyết định. Với production ML, metric phải trả lời được:

Model sai theo kiểu nào?
Kiểu sai đó gây thiệt hại gì?
Ai hoặc hệ thống nào sẽ hành động dựa trên prediction?
Action đó có capacity, latency hoặc compliance constraint không?
Model tốt offline có thật sự cải thiện business KPI online không?

Ví dụ fraud detection:

Câu hỏi	Ý nghĩa kỹ thuật	Ý nghĩa business
Positive class là gì?	`fraud = 1`	Giao dịch cần chặn hoặc review
Model trả gì?	Probability/score	Mức nghi ngờ fraud
Action là gì?	Threshold decision	Cho qua, manual review, hoặc block
False positive là gì?	Legit transaction bị flag	Khách thật bị làm phiền, giảm conversion
False negative là gì?	Fraud lọt qua	Mất tiền, chargeback, risk compliance
Capacity là gì?	Số alert xử lý được	Analyst chỉ review được N case/ngày

Nếu chưa định nghĩa action và cost, bạn chưa thật sự chọn được metric.

2. Confusion Matrix

confusion matrix là điểm bắt đầu tốt nhất cho classification vì nó hiển thị trực tiếp các loại đúng/sai.

	Predicted Positive	Predicted Negative
Actual Positive	True Positive (`TP`)	False Negative (`FN`)
Actual Negative	False Positive (`FP`)	True Negative (`TN`)

Trong fraud detection:

TP: fraud bị phát hiện.
FP: giao dịch hợp lệ bị flag.
FN: fraud lọt qua.
TN: giao dịch hợp lệ được cho qua.

Công thức tổng:

total = TP + FP + FN + TN

Điểm cần nhớ: confusion matrix phụ thuộc threshold. Cùng một model probability có thể tạo ra nhiều confusion matrix khác nhau khi threshold thay đổi.

3. Accuracy Và Cái Bẫy Imbalanced Dataset

accuracy đo tỷ lệ dự đoán đúng:

accuracy = (TP + TN) / (TP + FP + FN + TN)

Accuracy hữu ích khi:

Class tương đối cân bằng.
Cost của FP và FN gần nhau.
Bạn không chỉ quan tâm một class hiếm.
Dataset test phản ánh đúng traffic production.

Accuracy dễ gây hại khi positive class hiếm. Nếu fraud rate là 1%, model luôn dự đoán "not fraud" vẫn đạt 99% accuracy nhưng bắt được 0 fraud:

TP = 0
FP = 0
FN = toàn bộ fraud
TN = toàn bộ giao dịch hợp lệ
accuracy rất cao, business value gần như bằng 0

Tư duy cho Senior SE: accuracy giống một uptime aggregate toàn hệ thống. Số tổng thể có thể đẹp, nhưng endpoint quan trọng nhất vẫn có thể đang fail.

4. Precision, Recall Và F1

Precision

precision trả lời: trong các case model báo positive, bao nhiêu case thật sự positive?

precision = TP / (TP + FP)

Ưu tiên precision khi action positive đắt hoặc gây hại:

Tự động block payment.
Ban account.
Gửi cảnh báo bảo mật đến khách hàng.
Tạo ticket cho đội vận hành có capacity thấp.

Trade-off: tăng precision thường làm giảm recall. Model cẩn trọng hơn, ít báo nhầm hơn, nhưng bỏ sót nhiều positive hơn.

Recall

recall trả lời: trong các positive thật, model bắt được bao nhiêu?

recall = TP / (TP + FN)

Ưu tiên recall khi bỏ sót nguy hiểm:

Fraud mất tiền thật.
Medical triage.
Security incident.
PII leak.
Abuse/spam nghiêm trọng.

Trade-off: tăng recall thường làm giảm precision. Model báo nhiều hơn, bắt được nhiều positive hơn, nhưng tạo thêm false positive và workload.

F1

F1 là harmonic mean của precision và recall:

F1 = 2 * precision * recall / (precision + recall)

F1 hữu ích khi:

Bạn cần một số tổng hợp để so sánh nhanh nhiều model.
Precision và recall đều quan trọng tương đối cân bằng.
Chưa có cost model đủ tin cậy.

F1 không đủ khi cost lệch mạnh. Nếu FN fraud tốn 500 USD nhưng FP chỉ tốn 5 USD review, chọn threshold theo F1 có thể không tối ưu business.

5. ROC-AUC, PR-AUC Và Average Precision

Nhiều model trả score/probability thay vì label cứng. Khi đó cần metric đánh giá khả năng ranking trước khi chọn threshold.

ROC-AUC

ROC curve vẽ quan hệ giữa:

TPR = recall = TP / (TP + FN)
FPR = FP / (FP + TN)

ROC-AUC đo xác suất model xếp một positive ngẫu nhiên cao hơn một negative ngẫu nhiên. Giá trị gần 1 tốt hơn, 0.5 tương đương random.

Ưu điểm:

Không phụ thuộc một threshold cố định.
Tốt để so sánh ranking quality tổng quát giữa các model.
Ít bị ảnh hưởng bởi việc chọn threshold sai tạm thời.

Nhược điểm:

Có thể quá lạc quan với imbalanced dataset.
FPR nhỏ nhìn có vẻ tốt, nhưng vì số negative rất lớn nên vẫn có thể tạo rất nhiều false positive.
Không nói trực tiếp threshold nào dùng được trong production.

PR-AUC Và Average Precision

Precision-Recall curve vẽ quan hệ giữa precision và recall khi threshold thay đổi. PR-AUC tập trung vào positive class. Trong scikit-learn, metric thường dùng là average_precision_score, một dạng tóm tắt precision-recall curve.

PR-AUC hữu ích hơn ROC-AUC khi positive class hiếm:

Fraud.
Churn rate thấp.
Rare disease.
Anomaly detection.
Phishing/spam hiếm nhưng quan trọng.

Baseline trực giác của PR-AUC gần với positive rate. Nếu fraud rate là 1%, model random có average precision khoảng 0.01. Model đạt PR-AUC 0.25 có thể đã tốt hơn random rất nhiều, dù con số nghe không "cao" như ROC-AUC 0.95.

Rule thực tế:

Context	Metric nên xem trước	Lý do
Class cân bằng, cost tương đối đều	Accuracy, ROC-AUC, F1	Aggregate không quá misleading
Positive hiếm	PR-AUC, recall/precision tại threshold	Tập trung vào class cần bắt
Action positive rất đắt	Precision tại threshold, FP count	Tránh báo nhầm quá nhiều
Bỏ sót rất đắt	Recall tại threshold, FN cost	Tránh lọt case nguy hiểm
Có capacity cố định	Alerts per day, precision, cost	Metric phải khớp vận hành

6. Threshold Là Business Decision

Model thường trả probability:

p(fraud) = 0.73

Bạn cần threshold để chuyển score thành decision:

if p(fraud) >= threshold:
    flag_for_review()
else:
    approve()

Default 0.5 hiếm khi tối ưu. Threshold nên được chọn theo một trong các chiến lược:

Tối đa hóa F1 khi precision và recall quan trọng tương đối cân bằng.
Đạt recall >= target, rồi chọn precision/cost tốt nhất.
Đạt precision >= target, rồi chọn recall tốt nhất.
Tối đa hóa expected profit hoặc tối thiểu hóa expected cost.
Giữ số alert dưới capacity vận hành.
Dùng nhiều threshold cho nhiều action: allow, review, block.

Ví dụ 3 vùng quyết định trong fraud:

Score	Action	Lý do
`< 0.20`	Allow	Risk thấp
`0.20 - 0.85`	Manual review	Cần người kiểm tra
`>= 0.85`	Auto block	Precision đủ cao để tự động chặn

Đây thường là thiết kế tốt hơn một threshold duy nhất vì cost của manual review khác cost của auto block.

7. Regression Metrics: MAE, MSE, RMSE, MAPE

Regression dự đoán giá trị liên tục: giá nhà, demand, latency, revenue, delivery time. Classification metrics không dùng được trực tiếp.

Metric	Công thức trực giác	Khi dùng	Trade-off
`MAE`	Trung bình `abs(y_true - y_pred)`	Cần dễ giải thích, outlier không nên chi phối quá mạnh	Không phạt lỗi lớn mạnh như RMSE
`MSE`	Trung bình bình phương lỗi	Training objective phổ biến, phạt lỗi lớn	Đơn vị bị bình phương, khó giải thích
`RMSE`	Căn bậc hai của MSE	Muốn phạt lỗi lớn nhưng vẫn cùng đơn vị với target	Nhạy với outlier
`MAPE`	Trung bình lỗi phần trăm	Forecasting cần diễn giải theo %	Rất nguy hiểm khi `y_true` gần 0

Ví dụ chọn metric:

Dự đoán delivery time: MAE dễ nói "sai trung bình 4.2 phút".
Dự đoán demand cho kho hàng: RMSE nếu lỗi lớn gây hết hàng nghiêm trọng.
Dự đoán revenue theo cửa hàng: MAPE hữu ích nếu mọi cửa hàng có doanh thu đủ xa 0; nếu có cửa hàng revenue gần 0, dùng MAE, SMAPE hoặc metric custom.

Không chỉ nhìn một metric. Với regression production, nên báo cáo ít nhất:

MAE + RMSE + percentile absolute error + segment error

Ví dụ p95 absolute error giúp biết tail risk, tương tự latency p95 trong backend.

8. Ranking Metrics: Recall@k, MRR, NDCG

Ranking metrics dùng khi output là danh sách được sắp xếp: search result, recommendation, retrieval cho RAG.

Recall@k

Recall@k trả lời: trong top-k kết quả, hệ thống có lấy được item đúng không?

Recall@k = số relevant item xuất hiện trong top-k / tổng số relevant item

Với RAG, Recall@k rất quan trọng. Nếu retrieval không đưa đúng context vào top-k, LLM gần như không có cơ hội trả lời đúng, dù prompt hay.

MRR

MRR là mean reciprocal rank. Nó thưởng mạnh khi item đúng xuất hiện ở vị trí cao.

rank 1 -> 1.0
rank 2 -> 0.5
rank 5 -> 0.2
không tìm thấy -> 0

MRR phù hợp khi mỗi query thường chỉ cần một câu trả lời/item đúng đầu tiên.

NDCG

NDCG dùng khi có nhiều mức relevance, ví dụ:

0 = không liên quan
1 = hơi liên quan
2 = liên quan
3 = rất liên quan

NDCG thưởng việc đưa item relevance cao lên đầu danh sách. Nó phù hợp cho search, recommendation và RAG khi nhiều tài liệu có thể hỗ trợ câu trả lời ở mức khác nhau.

Rule chọn ranking metric:

Bài toán	Metric chính	Vì sao
RAG cần lấy đủ context	Recall@k	Không retrieve đúng thì generation khó đúng
FAQ/search cần câu trả lời đầu tiên tốt	MRR	Rank đầu rất quan trọng
Search/recommendation có nhiều mức relevance	NDCG@k	Thưởng thứ tự và relevance grade
Feed/recommendation tối ưu click/conversion	NDCG@k + business metric	Ranking tốt chưa chắc tăng revenue

9. Business Metric Vs ML Metric

ML metric đo model trên dataset. Business metric đo tác động thật của quyết định trong hệ thống.

Use case	ML metric	Business metric
Fraud detection	PR-AUC, recall, precision, FP/FN count	Fraud amount prevented, chargeback rate, legitimate approval rate, analyst workload
Churn prediction	ROC-AUC, PR-AUC, calibration, recall@top%	Retention uplift, campaign ROI, discount cost
Recommendation	NDCG@k, Recall@k, CTR prediction AUC	Conversion, revenue/session, long-term retention
RAG retrieval	Recall@k, MRR, NDCG@k	Answer correctness, support deflection, hallucination rate
Demand forecast	MAE, RMSE, p95 error	Stockout rate, inventory cost, waste

Một metric offline tốt chỉ là điều kiện cần. Điều kiện đủ là metric đó phải liên hệ được với decision và outcome.

10. Fraud Case Study

Bối cảnh

Giả sử hệ thống payment có 100,000 giao dịch/ngày:

Fraud rate: 1%.
Mỗi fraud lọt qua gây mất trung bình 500 USD.
Mỗi case manual review tốn 4 USD.
Mỗi false positive gây friction 15 USD do giảm conversion, support ticket hoặc trải nghiệm xấu.
Đội analyst xử lý tối đa 300 alert/ngày.

Vì Sao Accuracy Không Đủ?

Model luôn dự đoán "not fraud":

accuracy khoảng 99%
recall = 0
fraud prevented = 0
business loss vẫn rất lớn

Với fraud, câu hỏi đúng không phải "accuracy bao nhiêu?" mà là:

Bắt được bao nhiêu fraud thật?
Để bắt được số đó phải review/chặn nhầm bao nhiêu giao dịch hợp lệ?
Alert volume có vượt capacity không?
Expected value sau khi trừ review cost và friction có dương không?
Có segment nào bị false positive quá cao không?

Khi Nào Ưu Tiên Recall?

Ưu tiên recall khi:

Fraud amount lớn.
Regulatory/compliance risk cao.
Hệ thống có manual review capacity đủ lớn.
False positive chỉ gây friction nhẹ, không tự động block.

Ví dụ: giao dịch lớn hoặc merchant risk cao có thể dùng threshold thấp hơn để bắt nhiều fraud hơn.

Khi Nào Ưu Tiên Precision?

Ưu tiên precision khi:

Action là auto block.
False positive làm mất khách hàng hoặc doanh thu lớn.
Analyst capacity rất thấp.
Có yêu cầu legal/compliance về giải thích quyết định.

Ví dụ: threshold auto block phải cao hơn threshold manual review.

Best Solution Theo Context

Thiết kế tốt hơn cho fraud thường là policy nhiều tầng:

Tier	Điều kiện	Action	Metric guardrail
Low risk	Score thấp	Approve	FN rate theo segment
Medium risk	Score trung bình	Manual review	Alert volume, precision, analyst SLA
High risk	Score rất cao	Step-up auth hoặc block	Precision rất cao, complaint rate

Khi fraud loss lớn và capacity thấp, dùng threshold sweep theo expected value nhưng thêm constraint capacity:

Chọn threshold có net value cao nhất
với điều kiện alerts_per_day <= analyst_capacity
và precision >= mức tối thiểu business chấp nhận

11. Evaluation Workflow Gần Production

Quy trình thực tế nên đi theo thứ tự:

Định nghĩa target, positive class và prediction time.
Chọn split đúng: time-based split nếu dữ liệu có thời gian; stratified split cho bài tập/baseline.
Giữ test set cố định, không tune nhiều lần trên test set.
Train baseline đơn giản trước: dummy classifier, logistic regression, tree baseline.
Báo cáo class distribution và confusion matrix.
Báo cáo score metrics: ROC-AUC, PR-AUC/average precision.
Sweep threshold và tính precision, recall, F1, FP, FN, alert volume.
Tính business cost/profit theo assumption rõ ràng.
Kiểm tra metrics theo segment: country, merchant category, channel, device, customer type.
Kiểm tra calibration nếu probability được dùng như xác suất thật.
Chạy shadow mode hoặc A/B test trước khi tự động hóa action rủi ro.
Sau deploy, monitor data drift, label drift, precision proxy, alert volume, latency và business KPI.

12. Performance Considerations

Evaluation không chỉ là chất lượng model. Trong production, bạn phải đo cả runtime:

Feature generation latency.
Model inference latency p50/p95/p99.
Batch scoring time.
Memory footprint.
Throughput.
Cost per 1,000 predictions.
Alert volume per hour/day.

Fraud realtime target minh họa:

Feature fetch p95: < 50 ms
Model inference p95: < 20 ms
Decision end-to-end p95: < 100 ms
Alert volume: <= analyst capacity

Threshold sweep nên vectorized thay vì gọi model lại nhiều lần. Model chỉ cần scoring một lần để tạo y_score; sau đó tính metrics cho nhiều threshold từ cùng score.

Với ranking/RAG, performance còn gồm:

Retrieval latency.
Số document top-k.
Reranking latency.
Context size đưa vào LLM.
Cost token.

Tăng k có thể cải thiện Recall@k nhưng làm tăng latency, cost và nguy cơ đưa noise vào prompt.

13. Production Concerns

Những lỗi thường làm evaluation sai:

Data leakage: feature chứa thông tin tương lai hoặc target proxy.
Test set bị dùng quá nhiều lần để tune model.
Label delay: fraud/churn label xuất hiện muộn.
Feedback bias: chỉ case bị flag mới được review nên label không đầy đủ.
Segment fairness: aggregate tốt nhưng một nhóm khách hàng bị FP cao.
Calibration kém: score 0.8 không tương đương xác suất 80%.
Distribution drift: fraud pattern thay đổi sau vài tuần.
Business process thay đổi: đội analyst tăng/giảm capacity nhưng threshold không đổi.

Mitigation:

Dùng validation set để tune, test set để estimate cuối.
Dùng time-based backtesting nếu dữ liệu có thời gian.
Log score, threshold, features version, model version và action.
Lưu evaluation dataset như regression test.
Có dashboard offline/online metric.
Review threshold định kỳ theo capacity và drift.

14. Dùng Được Trong Production Không?

Có, các metrics và workflow trong bài dùng được trong production, nhưng chỉ khi thỏa các điều kiện sau:

Dataset evaluation đại diện cho traffic production hoặc được backtest theo thời gian.
Target và positive class được định nghĩa đúng business.
Không có leakage giữa train/validation/test.
Threshold được chọn theo cost, capacity hoặc guardrail cụ thể, không dùng mặc định 0.5 vô thức.
Metrics được báo cáo theo segment quan trọng, không chỉ aggregate.
Probability được calibration nếu dùng như xác suất thật.
Có monitoring sau deploy cho drift, latency, alert volume và business KPI.
Có fallback/human review cho action rủi ro cao.
Cost assumption được owner business xác nhận và cập nhật khi business thay đổi.

Nếu thiếu các điều kiện trên, metrics vẫn hữu ích cho học tập và offline research, nhưng chưa đủ để tự động ra quyết định production.

15. Tự Kiểm Tra

Vì sao model luôn dự đoán class majority có thể có accuracy cao nhưng business value thấp?
Trong fraud detection, FP và FN gây hậu quả khác nhau thế nào?
Khi nào nên nhìn PR-AUC trước ROC-AUC?
Vì sao threshold 0.5 không nên là lựa chọn mặc định?
F1 có thể sai hướng khi cost FP và FN lệch mạnh như thế nào?
Với bài toán delivery time prediction, khi nào chọn MAE và khi nào chọn RMSE?
Trong RAG, vì sao Recall@k thường là retrieval metric đầu tiên cần nhìn?
Cần điều kiện gì để đưa threshold đã chọn vào production?

Tài liệu

Tài liệu này dùng như cheat sheet khi review model. Mục tiêu là giúp bạn chọn metric nhanh nhưng vẫn đúng context.

1. Classification Metrics Reference

Ký hiệu:

Ký hiệu	Ý nghĩa
`TP`	Actual positive, predicted positive
`FP`	Actual negative, predicted positive
`FN`	Actual positive, predicted negative
`TN`	Actual negative, predicted negative

Metric	Công thức	Trả lời câu hỏi	Cẩn thận
Accuracy	`(TP + TN) / total`	Dự đoán đúng bao nhiêu trên tổng số?	Misleading khi class imbalance
Precision	`TP / (TP + FP)`	Trong những case bị báo positive, bao nhiêu case đúng?	Có thể cao nhưng bỏ sót nhiều
Recall	`TP / (TP + FN)`	Trong positive thật, bắt được bao nhiêu?	Có thể cao nhưng tạo nhiều FP
F1	`2PR / (P + R)`	Precision và recall cân bằng thế nào?	Không phản ánh cost lệch
FPR	`FP / (FP + TN)`	Negative thật bị báo nhầm bao nhiêu?	FPR nhỏ vẫn có nhiều FP nếu negative rất lớn
Specificity	`TN / (TN + FP)`	Negative thật được giữ đúng bao nhiêu?	Ít trực quan với business hơn FP count
ROC-AUC	Area under ROC curve	Ranking positive cao hơn negative tốt không?	Có thể lạc quan với positive hiếm
PR-AUC / Average Precision	Area/tóm tắt precision-recall curve	Model xử lý positive hiếm tốt không?	Không chọn threshold thay bạn

2. Regression Metrics Reference

Metric	Diễn giải	Khi nên dùng	Khi không nên dùng
MAE	Sai số tuyệt đối trung bình, cùng đơn vị với target	Stakeholder cần hiểu nhanh; outlier không nên chi phối quá mạnh	Lỗi lớn nghiêm trọng nhưng bị phạt chưa đủ mạnh
MSE	Sai số bình phương trung bình	Làm loss function; muốn phạt lỗi lớn	Đơn vị bị bình phương, khó giải thích
RMSE	Căn bậc hai của MSE, cùng đơn vị với target	Lỗi lớn nguy hiểm; cần cùng đơn vị với target	Nhạy với outlier
MAPE	Sai số phần trăm trung bình	Forecasting khi target luôn dương và xa 0	Target gần 0 hoặc có 0
p95 absolute error	95% sample có lỗi dưới mức này	Cần quản lý tail risk/SLA	Không thay thế metric trung bình

Checklist regression:

Luôn xem distribution của residual.
Báo cáo metric theo segment quan trọng.
Với forecasting, so sánh với naive baseline: dự đoán bằng ngày/tuần trước.
Nếu target có nhiều giá trị gần 0, đừng dùng MAPE làm metric chính.
Nếu lỗi under-prediction và over-prediction có cost khác nhau, cân nhắc metric custom hoặc quantile loss.

3. Ranking Metrics Reference

Metric	Dùng cho	Ý nghĩa	Trade-off
Recall@k	Retrieval, RAG, recommendation	Trong top-k có lấy đủ item relevant không?	Tăng k thường tăng recall nhưng tăng latency/cost
Precision@k	Search/recommendation	Top-k sạch đến mức nào?	Không đo đủ recall nếu nhiều relevant item
MRR	FAQ/search có một answer chính	Item đúng đầu tiên nằm cao không?	Không quan tâm nhiều relevant item sau item đúng đầu
NDCG@k	Search/recommendation có relevance grade	Item relevant cao có được xếp lên đầu không?	Cần label relevance theo mức

RAG rule:

Retrieval quality trước, generation quality sau.
Nếu Recall@k thấp, LLM thiếu context đúng và dễ hallucinate.

4. Decision Matrix Chọn Metric

Context	Metric chính	Metric phụ	Quyết định thường gặp
Binary classification cân bằng	Accuracy, F1, ROC-AUC	Confusion matrix	Chọn model có generalization tốt
Fraud/anomaly positive hiếm	PR-AUC, recall, precision, FP/FN count	ROC-AUC, calibration	Chọn threshold theo cost/capacity
Medical/security triage	Recall, FN count	Precision, workload	Giữ recall cao, thêm human review
Auto block/ban user	Precision, FP count	Recall, complaint rate	Threshold cao, audit kỹ
Churn campaign	PR-AUC, recall@budget, calibration	ROI, uplift	Chọn top-N user để target
Regression forecast	MAE/RMSE, p95 error	MAPE nếu an toàn	Chọn model theo cost lỗi lớn
Search/RAG retrieval	Recall@k, MRR/NDCG	Latency, token cost	Chọn k/reranker theo quality-cost

5. Threshold Selection Recipes

Recipe A: Tối Đa F1

Dùng khi:

Precision và recall quan trọng tương đối cân bằng.
Chưa có cost model rõ.
Cần baseline nhanh.

Không đủ khi:

FP và FN có cost lệch mạnh.
Có capacity constraint.
Action positive gây hại lớn.

Recipe B: Recall Guardrail

Chọn threshold có cost thấp nhất
với điều kiện recall >= target_recall

Dùng khi bỏ sót positive rất nguy hiểm. Ví dụ: security triage muốn recall ít nhất 95%.

Recipe C: Precision Guardrail

Chọn threshold có recall cao nhất
với điều kiện precision >= target_precision

Dùng khi action positive gây tác động mạnh. Ví dụ: auto block cần precision ít nhất 98%.

Recipe D: Capacity Constraint

Chọn threshold có expected value cao nhất
với điều kiện alerts_per_day <= analyst_capacity

Dùng khi đội vận hành chỉ xử lý được số case hữu hạn.

Recipe E: Expected Value

Ví dụ fraud manual review:

value = TP * fraud_loss_prevented
cost = FP * false_positive_friction + (TP + FP) * review_cost
net_value = value - cost

Chọn threshold có net_value cao nhất, sau đó kiểm tra guardrail:

Precision tối thiểu.
Recall tối thiểu.
Alert volume.
Segment fairness.
Latency.

6. Fraud Evaluation Template

Khi review fraud model, báo cáo ít nhất:

Dataset:
- Time window:
- Train/validation/test split:
- Positive class:
- Fraud rate:
- Label delay:

Score metrics:
- ROC-AUC:
- PR-AUC / average precision:
- Calibration check:

Threshold decision:
- Candidate thresholds:
- Selected threshold:
- Precision / recall / F1:
- TP / FP / FN / TN:
- Alerts per day:
- Expected net value:
- Capacity OK:

Segment checks:
- Country:
- Merchant category:
- Payment method:
- Customer age/cohort:
- Device/channel:

Production checks:
- Feature latency p95:
- Inference latency p95:
- Drift monitoring:
- Feedback loop:
- Human review process:

7. Segment Metrics

Aggregate metric có thể che lỗi theo segment. Luôn kiểm tra các segment có business hoặc fairness risk:

Region/country.
Language.
Device/channel.
Customer type.
Merchant category.
Account age.
Transaction amount bucket.
Traffic source.

Ví dụ một fraud model có precision tổng 70%, nhưng ở segment new_user + wallet precision chỉ 20%. Nếu auto block segment này, business sẽ nhận nhiều complaint dù aggregate nhìn ổn.

8. Offline Vs Online Metrics

Loại metric	Ví dụ	Mục tiêu	Rủi ro
Offline ML metric	ROC-AUC, PR-AUC, F1, MAE, NDCG	So sánh model trước deploy	Dataset không đại diện production
Online technical metric	Latency, error rate, throughput	Hệ thống chạy ổn định	Model đúng nhưng serve chậm
Online business metric	Fraud loss, conversion, revenue, retention	Tác động thật	Bị ảnh hưởng bởi nhiều yếu tố ngoài model
Ops metric	Alert volume, review SLA, analyst precision	Vận hành được không	Model tạo workload vượt capacity

Không deploy chỉ vì offline metric tăng. Hãy hỏi:

Metric tăng có đủ lớn để đáng risk không?
Có cải thiện business metric không?
Có làm xấu latency/cost/capacity không?
Có segment nào bị ảnh hưởng xấu không?

9. Production Readiness Checklist

10. API Ghi Nhớ Cho Bài Tập scikit-learn

Các API dùng trong bài tập đã được đối chiếu với tài liệu stable của scikit-learn qua Context7:

sklearn.pipeline.Pipeline
sklearn.compose.ColumnTransformer
sklearn.preprocessing.OneHotEncoder(handle_unknown="ignore")
sklearn.metrics.confusion_matrix
sklearn.metrics.precision_recall_curve
sklearn.metrics.average_precision_score
sklearn.metrics.roc_auc_score
sklearn.model_selection.train_test_split

Pattern production-friendly:

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_pipeline, numeric_features),
        ("cat", categorical_pipeline, categorical_features),
    ]
)

model = Pipeline(
    steps=[
        ("preprocess", preprocess),
        ("classifier", classifier),
    ]
)

Điểm quan trọng: mọi transformer có .fit() phải nằm trong Pipeline và chỉ fit trên training data.

11. Câu Trả Lời Production Ngắn Gọn

Dùng được trong production không? Có, nếu:

Dataset evaluation đúng thời gian và đại diện traffic thật.
Metric được chọn theo business objective.
Threshold được chọn theo cost/capacity/guardrail.
Có monitoring online và process review định kỳ.
Có kiểm tra segment, drift, latency và feedback bias.

Không đủ production nếu chỉ có một notebook với accuracy/ROC-AUC đẹp nhưng không có threshold, cost model, segment analysis và monitoring plan.

Bài tập

Bài tập này giúp bạn xây một evaluation pipeline gần production cho fraud detection. Dataset là synthetic để chạy được nhanh, nhưng workflow phản ánh cách làm thật: split đúng, Pipeline, preprocessing nhất quán, score metrics, threshold sweep, expected value và capacity reasoning.

1. Chuẩn Bị

Cài thư viện:

pip install numpy pandas scikit-learn

Tạo file local tùy ý, ví dụ fraud_metrics_day06.py, rồi chép code dưới đây để chạy.

2. Full Script

from __future__ import annotations

from dataclasses import dataclass

import numpy as np
import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.datasets import make_classification
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score,
    average_precision_score,
    confusion_matrix,
    f1_score,
    mean_absolute_error,
    mean_absolute_percentage_error,
    mean_squared_error,
    precision_recall_curve,
    precision_score,
    recall_score,
    roc_auc_score,
)
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler


RANDOM_STATE = 42


@dataclass(frozen=True)
class BusinessAssumptions:
    fraud_loss_usd: float = 500.0
    review_cost_usd: float = 4.0
    false_positive_friction_usd: float = 15.0
    daily_transaction_volume: int = 100_000
    analyst_capacity_per_day: int = 300


def make_fraud_like_dataset(
    n_samples: int = 60_000,
    random_state: int = RANDOM_STATE,
) -> tuple[pd.DataFrame, np.ndarray]:
    """Create an imbalanced fraud-like dataset with numeric and categorical features."""
    rng = np.random.default_rng(random_state)

    X_num, y = make_classification(
        n_samples=n_samples,
        n_features=10,
        n_informative=5,
        n_redundant=2,
        n_clusters_per_class=3,
        weights=[0.985, 0.015],
        class_sep=1.15,
        flip_y=0.01,
        random_state=random_state,
    )

    df = pd.DataFrame(X_num, columns=[f"numeric_{i:02d}" for i in range(X_num.shape[1])])

    # Amount is intentionally skewed, like real transaction amount.
    amount = np.exp(np.clip(df["numeric_00"] + 3.0, 0.0, 8.0))
    amount += rng.gamma(shape=2.0, scale=20.0, size=n_samples)
    df["amount_usd"] = amount.round(2)
    df["hour_of_day"] = rng.integers(0, 24, size=n_samples)
    df["account_age_days"] = rng.integers(1, 2_000, size=n_samples)

    merchant_categories = np.array(
        ["grocery", "travel", "electronics", "gaming", "gift_card", "crypto", "fashion"]
    )
    df["merchant_category"] = rng.choice(
        merchant_categories,
        size=n_samples,
        p=[0.30, 0.15, 0.18, 0.12, 0.10, 0.05, 0.10],
    )

    payment_methods = np.array(["card", "wallet", "bank_transfer", "bnpl"])
    df["payment_method"] = rng.choice(payment_methods, size=n_samples, p=[0.68, 0.18, 0.10, 0.04])

    countries = np.array(["VN", "US", "SG", "ID", "TH", "BR", "NG"])
    df["country"] = rng.choice(countries, size=n_samples, p=[0.42, 0.18, 0.08, 0.12, 0.10, 0.06, 0.04])

    # Inject business-shaped signal into categorical fields without making the task trivial.
    fraud_idx = y == 1
    df.loc[fraud_idx, "merchant_category"] = rng.choice(
        ["gift_card", "crypto", "electronics", "gaming"],
        size=int(fraud_idx.sum()),
        p=[0.35, 0.30, 0.20, 0.15],
    )
    df.loc[fraud_idx, "payment_method"] = rng.choice(
        ["wallet", "card", "bnpl"],
        size=int(fraud_idx.sum()),
        p=[0.45, 0.35, 0.20],
    )

    # Simulate a few missing values to force the pipeline to handle real-world input.
    missing_mask = rng.random(n_samples) < 0.01
    df.loc[missing_mask, "account_age_days"] = np.nan
    return df, y


def build_model(numeric_features: list[str], categorical_features: list[str]) -> Pipeline:
    numeric_pipeline = Pipeline(
        steps=[
            ("imputer", SimpleImputer(strategy="median")),
            ("scaler", StandardScaler()),
        ]
    )

    categorical_pipeline = Pipeline(
        steps=[
            ("imputer", SimpleImputer(strategy="most_frequent")),
            ("one_hot", OneHotEncoder(handle_unknown="ignore")),
        ]
    )

    preprocess = ColumnTransformer(
        transformers=[
            ("num", numeric_pipeline, numeric_features),
            ("cat", categorical_pipeline, categorical_features),
        ]
    )

    return Pipeline(
        steps=[
            ("preprocess", preprocess),
            (
                "classifier",
                LogisticRegression(
                    class_weight="balanced",
                    max_iter=2_000,
                    n_jobs=None,
                    random_state=RANDOM_STATE,
                ),
            ),
        ]
    )


def evaluate_threshold(
    y_true: np.ndarray,
    y_score: np.ndarray,
    threshold: float,
    assumptions: BusinessAssumptions,
) -> dict[str, float | bool]:
    y_pred = (y_score >= threshold).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    alerts = tp + fp
    scale = assumptions.daily_transaction_volume / len(y_true)

    prevented_loss = tp * assumptions.fraud_loss_usd
    missed_loss = fn * assumptions.fraud_loss_usd
    review_cost = alerts * assumptions.review_cost_usd
    false_positive_friction = fp * assumptions.false_positive_friction_usd
    net_value = prevented_loss - review_cost - false_positive_friction

    alerts_per_day = alerts * scale
    return {
        "threshold": threshold,
        "accuracy": accuracy_score(y_true, y_pred),
        "precision": precision_score(y_true, y_pred, zero_division=0),
        "recall": recall_score(y_true, y_pred, zero_division=0),
        "f1": f1_score(y_true, y_pred, zero_division=0),
        "tp": int(tp),
        "fp": int(fp),
        "fn": int(fn),
        "tn": int(tn),
        "alerts": int(alerts),
        "alerts_per_day": alerts_per_day,
        "missed_loss_per_day": missed_loss * scale,
        "net_value_per_day": net_value * scale,
        "capacity_ok": alerts_per_day <= assumptions.analyst_capacity_per_day,
    }


def build_threshold_report(
    y_true: np.ndarray,
    y_score: np.ndarray,
    assumptions: BusinessAssumptions,
) -> pd.DataFrame:
    thresholds = np.round(np.linspace(0.01, 0.99, 99), 2)
    rows = [evaluate_threshold(y_true, y_score, threshold, assumptions) for threshold in thresholds]
    return pd.DataFrame(rows)


def print_score_metrics(y_true: np.ndarray, y_score: np.ndarray) -> None:
    positive_rate = y_true.mean()
    print("\n=== Score metrics ===")
    print(f"Positive rate:      {positive_rate:.4f}")
    print(f"ROC-AUC:            {roc_auc_score(y_true, y_score):.4f}")
    print(f"PR-AUC / AP:        {average_precision_score(y_true, y_score):.4f}")

    precision, recall, pr_thresholds = precision_recall_curve(y_true, y_score)
    f1_values = 2 * precision[:-1] * recall[:-1] / np.maximum(precision[:-1] + recall[:-1], 1e-12)
    best_idx = int(np.argmax(f1_values))
    print(f"Best F1 threshold:  {pr_thresholds[best_idx]:.4f}")
    print(f"Best F1 from curve: {f1_values[best_idx]:.4f}")


def run_classification_experiment() -> None:
    assumptions = BusinessAssumptions()
    X, y = make_fraud_like_dataset()

    numeric_features = X.select_dtypes(include="number").columns.tolist()
    categorical_features = X.select_dtypes(exclude="number").columns.tolist()

    X_train, X_test, y_train, y_test = train_test_split(
        X,
        y,
        test_size=0.25,
        stratify=y,
        random_state=RANDOM_STATE,
    )

    model = build_model(numeric_features, categorical_features)
    model.fit(X_train, y_train)
    y_score = model.predict_proba(X_test)[:, 1]

    print_score_metrics(y_test, y_score)

    report = build_threshold_report(y_test, y_score, assumptions)
    columns = [
        "threshold",
        "precision",
        "recall",
        "f1",
        "tp",
        "fp",
        "fn",
        "alerts_per_day",
        "net_value_per_day",
        "capacity_ok",
    ]

    print("\n=== Selected thresholds ===")
    selected = report[report["threshold"].isin([0.10, 0.20, 0.30, 0.50, 0.70, 0.90])]
    print(selected[columns].to_string(index=False))

    best_value = report.sort_values("net_value_per_day", ascending=False).iloc[0]
    print("\n=== Best threshold by expected net value ===")
    print(best_value[columns].to_string())

    capacity_candidates = report[report["capacity_ok"]]
    if capacity_candidates.empty:
        print("\nNo threshold satisfies analyst capacity. Consider top-N review instead of threshold.")
    else:
        best_capacity = capacity_candidates.sort_values("net_value_per_day", ascending=False).iloc[0]
        print("\n=== Best threshold with capacity constraint ===")
        print(best_capacity[columns].to_string())

    high_recall_candidates = report[report["recall"] >= 0.80]
    if not high_recall_candidates.empty:
        best_high_recall = high_recall_candidates.sort_values(
            "net_value_per_day",
            ascending=False,
        ).iloc[0]
        print("\n=== Best threshold with recall >= 0.80 ===")
        print(best_high_recall[columns].to_string())


def run_regression_metric_mini_demo() -> None:
    y_true = np.array([100, 120, 130, 90, 600, 80], dtype=float)
    y_pred = np.array([105, 110, 125, 100, 420, 82], dtype=float)

    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    rmse = float(np.sqrt(mse))
    mape = mean_absolute_percentage_error(y_true, y_pred)

    print("\n=== Regression metric mini demo ===")
    print(f"MAE:  {mae:.2f}")
    print(f"MSE:  {mse:.2f}")
    print(f"RMSE: {rmse:.2f}")
    print(f"MAPE: {mape:.2%}")


def recall_at_k(relevant: set[str], ranked: list[str], k: int) -> float:
    if not relevant:
        return 0.0
    return len(relevant.intersection(ranked[:k])) / len(relevant)


def reciprocal_rank_at_k(relevant: set[str], ranked: list[str], k: int) -> float:
    for idx, item_id in enumerate(ranked[:k], start=1):
        if item_id in relevant:
            return 1.0 / idx
    return 0.0


def ndcg_at_k(relevance_by_item: dict[str, int], ranked: list[str], k: int) -> float:
    def dcg(items: list[str]) -> float:
        score = 0.0
        for idx, item_id in enumerate(items, start=1):
            rel = relevance_by_item.get(item_id, 0)
            score += (2**rel - 1) / np.log2(idx + 1)
        return score

    actual = dcg(ranked[:k])
    ideal_items = sorted(relevance_by_item, key=relevance_by_item.get, reverse=True)
    ideal = dcg(ideal_items[:k])
    return 0.0 if ideal == 0 else actual / ideal


def run_ranking_metric_mini_demo() -> None:
    relevant_docs = {"doc_2", "doc_5"}
    ranked_docs = ["doc_9", "doc_2", "doc_7", "doc_5", "doc_1"]
    relevance_grade = {"doc_2": 3, "doc_5": 2, "doc_7": 1}

    print("\n=== Ranking metric mini demo ===")
    print(f"Recall@3: {recall_at_k(relevant_docs, ranked_docs, k=3):.4f}")
    print(f"MRR@5:    {reciprocal_rank_at_k(relevant_docs, ranked_docs, k=5):.4f}")
    print(f"NDCG@5:   {ndcg_at_k(relevance_grade, ranked_docs, k=5):.4f}")


if __name__ == "__main__":
    run_classification_experiment()
    run_regression_metric_mini_demo()
    run_ranking_metric_mini_demo()

3. Cách Đọc Output

Bạn sẽ thấy ba nhóm output:

Score metrics: positive rate, ROC-AUC, PR-AUC/average precision, best F1 threshold.
Selected thresholds: precision/recall/F1, confusion matrix count, alert volume và expected value ở một số threshold.
Best threshold: threshold tối ưu theo net value, capacity và recall guardrail.

Đừng chỉ chọn threshold có F1 cao nhất. Hãy so sánh với:

net_value_per_day
alerts_per_day
capacity_ok
precision
recall
fp và fn

Nếu threshold tối ưu theo net value tạo 2,000 alerts/ngày nhưng analyst chỉ xử lý được 300, threshold đó chưa deploy được. Bạn phải chọn threshold thỏa capacity hoặc đổi policy sang top-N review.

4. Bài Tập Bắt Buộc

Các snippet trong phần này nên được đặt bên trong run_classification_experiment() sau khi đã có X_test, y_test, y_score và report, trừ khi bài tập nói rõ là sửa hàm khác.

Bài 1: Accuracy Trap

Thêm baseline luôn dự đoán 0:

y_pred_all_negative = np.zeros_like(y_test)
print(accuracy_score(y_test, y_pred_all_negative))
print(recall_score(y_test, y_pred_all_negative, zero_division=0))

Trả lời:

Accuracy của baseline là bao nhiêu?
Recall là bao nhiêu?
Vì sao baseline này không có business value?

Bài 2: Đổi Cost Assumption

Đổi:

BusinessAssumptions(fraud_loss_usd=5_000.0)

So sánh selected threshold trước và sau khi đổi fraud loss.

Trả lời:

Threshold tối ưu có giảm không?
Recall có tăng không?
Alert volume có vượt capacity không?
Điều này nói gì về trade-off giữa fraud loss và operational cost?

Bài 3: Capacity 300 Alerts/Ngày

Giữ analyst_capacity_per_day=300.

Trả lời:

Threshold nào có capacity_ok = True và net_value_per_day cao nhất?
Nếu không threshold nào vừa đạt recall mong muốn vừa không vượt capacity, bạn sẽ đề xuất gì?

Gợi ý solution:

Dùng top-N scoring thay vì threshold tuyệt đối.
Chia tier: auto allow, manual review, step-up auth, auto block.
Tăng analyst capacity cho high-risk season.
Thêm feature tốt hơn để tăng precision.

Bài 4: Precision Guardrail Cho Auto Block

Giả sử auto block cần precision ít nhất 95%. Lọc report:

auto_block_candidates = report[report["precision"] >= 0.95]

Trả lời:

Có threshold nào đạt không?
Recall ở threshold đó có thấp không?
Nếu recall thấp nhưng precision cao, action nào phù hợp: auto block hay manual review?

Bài 5: Segment Analysis

Tạo report theo merchant_category:

X_eval = X_test.copy()
X_eval["y_true"] = y_test
X_eval["y_score"] = y_score
X_eval["y_pred"] = (y_score >= 0.5).astype(int)

segment_rows = []
for segment, group in X_eval.groupby("merchant_category"):
    if group["y_true"].nunique() < 2:
        continue
    segment_rows.append(
        {
            "merchant_category": segment,
            "rows": len(group),
            "positive_rate": group["y_true"].mean(),
            "precision": precision_score(group["y_true"], group["y_pred"], zero_division=0),
            "recall": recall_score(group["y_true"], group["y_pred"], zero_division=0),
            "roc_auc": roc_auc_score(group["y_true"], group["y_score"]),
            "average_precision": average_precision_score(group["y_true"], group["y_score"]),
        }
    )

segment_report = pd.DataFrame(segment_rows).sort_values("average_precision")
print(segment_report.to_string(index=False))

Trả lời:

Segment nào yếu nhất?
Segment yếu do ít data, positive rate khác, hay model ranking kém?
Có nên dùng cùng một threshold cho mọi segment không?

Bài 6: Regression Metrics

Trong mini demo, sửa y_pred để một sample lỗi rất lớn. Quan sát MAE và RMSE.

Trả lời:

RMSE tăng mạnh hơn MAE như thế nào?
Với bài toán dự đoán demand, vì sao lỗi lớn có thể nguy hiểm hơn lỗi nhỏ?
Khi nào MAPE không nên dùng?

Bài 7: Ranking Metrics

Sửa ranked_docs sao cho relevant doc đầu tiên từ rank 2 xuống rank 5.

Trả lời:

Recall@5 có đổi không?
MRR@5 có đổi không?
NDCG@5 có đổi không?
Vì sao trong RAG, Recall@k cao nhưng MRR thấp vẫn có thể làm trải nghiệm kém?

5. Câu Hỏi Review Sau Khi Làm

Metric nào bạn sẽ đưa vào dashboard offline cho fraud model?
Metric nào bạn sẽ đưa vào dashboard online?
Threshold production bạn chọn là bao nhiêu và vì sao?
Nếu business tăng analyst capacity từ 300 lên 1,000 alerts/ngày, threshold có nên đổi không?
Nếu fraud pattern drift sau 2 tuần, metric nào sẽ báo hiệu sớm?
Nếu model có PR-AUC tốt hơn nhưng latency p95 tăng 5 lần, có deploy không?

6. Expected Takeaways

Sau bài tập, bạn cần rút ra được:

Accuracy gần như vô dụng nếu positive class rất hiếm và business quan tâm positive class.
ROC-AUC tốt không đảm bảo threshold production tốt.
PR-AUC/average precision hữu ích hơn cho imbalanced classification.
Threshold là business decision, không chỉ là model decision.
Cost/profit và capacity có thể chọn threshold khác với F1.
Segment metrics là bắt buộc trước production.
Regression và ranking cần metric riêng, không dùng classification mindset áp đặt.

7. Production Readiness Của Bài Tập

Code trong bài dùng được như skeleton cho production evaluation, nhưng chưa phải production system hoàn chỉnh.

Dùng được nếu bổ sung:

Data thật với time-based split và label delay handling.
Feature pipeline giống training và serving.
Cost assumption được business owner xác nhận.
Segment/fairness report đầy đủ.
Calibration nếu score được dùng như xác suất.
Model registry, versioning và reproducible training.
Monitoring sau deploy: drift, latency, alert volume, precision proxy, business KPI.
Human review workflow và rollback policy.

Không nên dùng trực tiếp để auto block giao dịch thật nếu chưa có các điều kiện trên.