Day 13: Attention Mechanism

Mục Tiêu

Sau bài này, bạn cần làm được các việc sau:

Giải thích được Query, Key, Value bằng mental model gần với search/index/cache trong backend.
Tính được scaled dot-product attention ở mức shape: QK^T -> softmax -> weighted sum V.
Phân biệt được self-attention, padding mask và causal mask.
Hiểu multi-head attention: vì sao chia nhiều head, concat lại, rồi project về embed_dim.
Giải thích được vì sao attention train parallel tốt hơn RNN, nhưng tốn O(n^2) theo sequence length.
Viết và review được một module PyTorch self-attention nhỏ có shape validation, mask handling, dropout, dtype/device awareness và test cơ bản.
Trả lời được câu hỏi: dùng attention trong production được không, và cần điều kiện gì.

Vị Trí Trong Phase 2

Day 12 đã nói rằng text sau tokenizer trở thành sequence token id, có padding, truncation và attention mask. Day 13 giải thích model dùng mask đó như thế nào để các token "nhìn" nhau. Day 14 sẽ đặt attention vào một Transformer block hoàn chỉnh cùng positional encoding, LayerNorm, feed-forward network và residual connection.

Day 12: text -> token ids -> embedding-ready sequence
Day 13: token embeddings -> attention -> contextual embeddings
Day 14: attention + FFN + norm + residual -> Transformer architecture

Cách Học Trong 2 Giờ

Thời lượng	Việc cần làm	Output
10 phút	Đọc TL;DR và diagram Q/K/V	Nắm được attention là content-based routing giữa token
30 phút	Đọc `document.md` phần 1-5	Hiểu công thức, shape, mask và self-attention
25 phút	Đọc phần PyTorch implementation	Biết cách kiểm tra shape, dtype, device và mask
25 phút	Đọc trade-off production	Biết rủi ro `O(n^2)`, memory, long context, FlashAttention, KV cache
25 phút	Làm `exercise.md` và chạy script demo	Có output test và attention weight đơn giản
5 phút	Tự trả lời checklist	Biết phần nào cần học lại trước Day 14

TL;DR

Attention là cơ chế để mỗi token chọn thông tin quan trọng từ các token khác.

token hiện tại tạo Query  -> "tôi đang cần gì?"
token khác có Key         -> "tôi có dấu hiệu gì để được tìm thấy?"
token khác có Value       -> "nội dung tôi sẽ truyền đi là gì?"

score  = Query dot Key
weight = softmax(score / sqrt(head_dim))
output = weighted sum(Value)

Với self-attention, Query/Key/Value đều đến từ cùng một sequence. Tất cả token có thể tính attention bằng matrix multiplication nên training trên GPU parallel tốt hơn RNN. Giá phải trả là attention matrix có kích thước seq_len x seq_len, khiến compute và memory tăng theo O(n^2).

Diagram Nhanh

Input embeddings X: [batch, seq_len, embed_dim]
        |
        | Linear projections học được
        v
 Q = XWq     K = XWk     V = XWv
        \       |       /
         \      |      /
          Q @ K^T / sqrt(d_k)
                 |
              mask nếu có
                 |
              softmax
                 |
              @ V
                 |
 Contextual embeddings: [batch, seq_len, embed_dim]

Mental Model Cho Senior SE

Attention concept	Backend analogy	Điểm cần nhớ
Query	Request cần tìm context	Mỗi token tự tạo request vector
Key	Search index/signature	Token nào match Query cao sẽ được chú ý hơn
Value	Payload/record content	Thứ thật sự được aggregate vào output
Score	Ranking score	Dot product đo độ tương thích
Softmax	Normalize priority	Biến score thành weight tổng gần 1
Mask	Filter/permission/window	Chặn PAD hoặc future token
Head	Một view độc lập	Nhiều head học nhiều kiểu quan hệ

Best Solution Theo Context

Context	Lựa chọn nên dùng	Lý do
Học nền tảng	Tự implement scaled dot-product attention bằng PyTorch	Thấy rõ Q/K/V, shape và mask
Training/inference production	Dùng implementation đã tối ưu trong PyTorch/Hugging Face/runtime serving	Giảm rủi ro bug, tận dụng kernel tối ưu
Text classification	Bidirectional encoder attention	Token được nhìn toàn bộ input
Autoregressive generation	Causal self-attention	Không nhìn future token, tránh leakage
Long-context RAG	Retrieval/chunking trước, không nhồi mọi thứ vào prompt	Attention `O(n^2)`, KV cache tốn VRAM
GPU hiện đại, sequence dài	SDPA/FlashAttention-style kernel nếu runtime hỗ trợ	Giảm memory materialize attention matrix

Dùng Được Trong Production Không?

Có. Attention là core primitive của Transformer và đang được dùng trong production rất rộng rãi. Nhưng có hai tầng cần phân biệt:

Dùng model attention-based trong production: có, nếu có eval, serving runtime, monitoring, token budget và rollback.
Tự viết attention kernel/module cho production: chỉ nên làm khi có lý do rõ ràng như custom research, custom mask/window, hoặc model nhỏ nội bộ; còn lại nên dùng implementation đã được tối ưu và kiểm thử.

Điều kiện tối thiểu:

Shape contract rõ ràng: batch, seq_len, embed_dim, num_heads, head_dim.
Tokenizer, padding policy, truncation policy và attention mask được version cùng model.
Test riêng cho padding mask và causal mask để tránh data leakage.
Giới hạn sequence length theo SLA, VRAM/RAM và p95/p99 latency.
Monitoring token length distribution, OOM, timeout, truncation rate, latency và error type.
Với decoder-only inference, dùng KV cache/runtime phù hợp thay vì tính lại toàn bộ prefix mỗi token.
Có fallback khi context quá dài: reject, summarize, retrieve lại, hoặc degrade model.

Deliverable Cuối Ngày

Bạn nên có:

Ghi chú riêng mô tả Q/K/V và công thức attention bằng lời của mình.
Một lần chạy attention_demo.py trong folder bài học.
Ít nhất 3 test hoặc assertion về shape, mask và dropout train/eval.
Một đoạn note ngắn trả lời: nếu sequence length tăng 4 lần thì memory attention tăng bao nhiêu lần và ảnh hưởng production thế nào.

Checklist Hoàn Thành

Tài liệu

1. Attention Là Gì?

Attention là cơ chế content-based routing giữa các token. Thay vì nén toàn bộ câu thành một hidden state tuần tự như RNN truyền thống, attention cho phép mỗi token hỏi: "trong sequence này, token nào quan trọng với tôi?"

Ví dụ câu:

"Khách hàng muốn hoàn tiền vì đơn hàng bị hỏng"

Khi xử lý token "hoàn tiền", model có thể chú ý mạnh đến "khách hàng", "đơn hàng", "bị hỏng". Khi xử lý token "hỏng", model có thể chú ý đến "đơn hàng". Đây là cách token nhận thêm context từ token khác.

Luồng cơ bản:

embedding của token
  -> tạo Query, Key, Value
  -> Query so khớp với Key của mọi token
  -> softmax tạo attention weight
  -> weighted sum các Value
  -> contextual embedding

2. Query, Key, Value Step By Step

Giả sử input embedding là X:

X: [batch, seq_len, embed_dim]

Model học 3 projection:

Q = X Wq
K = X Wk
V = X Wv

Ý nghĩa:

Query: token hiện tại đang cần loại thông tin gì.
Key: token này có đặc điểm gì để token khác tìm đến.
Value: nội dung sẽ được truyền đi nếu token này được attend.

Điểm quan trọng: Q/K/V không phải dictionary key-value do engineer viết tay. Chúng là tensor được tạo từ learned weights trong training.

Ví dụ trực giác:

Token "refund"
Query: cần biết nguyên nhân và đối tượng liên quan

Token "broken"
Key: dấu hiệu lỗi sản phẩm
Value: thông tin "bị hỏng"

Nếu Query("refund") match Key("broken") cao, Value("broken") được đưa nhiều hơn vào output của "refund".

3. Scaled Dot-Product Attention

Công thức:

Attention(Q, K, V) = softmax((Q K^T) / sqrt(d_k)) V

Shape phổ biến trong multi-head attention:

Q: [batch, heads, target_len, head_dim]
K: [batch, heads, source_len, head_dim]
V: [batch, heads, source_len, head_dim]

Q K^T:           [batch, heads, target_len, source_len]
attention weight:[batch, heads, target_len, source_len]
output:          [batch, heads, target_len, head_dim]

Với self-attention, target_len == source_len == seq_len.

Vì Sao Chia Cho `sqrt(d_k)`?

Dot product của vector dài thường có độ lớn tăng theo dimension. Nếu score quá lớn, softmax dễ bị saturated:

score rất lớn -> softmax gần one-hot -> gradient yếu -> training khó ổn định

Chia cho sqrt(head_dim) giúp score có scale ổn định hơn.

4. Softmax Biến Score Thành Weight

Nếu token i có score với các token khác:

[2.0, 1.0, -1.0]

Sau softmax, ta có weight gần:

[0.71, 0.26, 0.03]

Output của token i là:

0.71 * Value(token_1) + 0.26 * Value(token_2) + 0.03 * Value(token_3)

Attention weight hữu ích để debug, nhưng không nên xem là explainability đầy đủ cho audit. Model còn có nhiều layer, residual connection, feed-forward network và nonlinear transformation phía sau.

5. Mask: Padding Mask Và Causal Mask

Padding Mask

Day 12 đã nói batch thường cần padding:

["tôi", "thích", "AI", "<pad>", "<pad>"]

Model không nên attend vào <pad> vì đó không phải nội dung thật.

Mask thường dùng True cho token hợp lệ:

[True, True, True, False, False]

Trong implementation manual bằng masked_fill, thường chuyển thành:

scores = scores.masked_fill(~allowed_mask, very_negative_value)

Sau softmax, vị trí bị mask có weight gần 0.

Causal Mask

Causal mask dùng cho decoder-only language model:

Token ở vị trí i chỉ được nhìn token <= i

Ma trận causal cho sequence length 5:

      key position
        0 1 2 3 4
q=0     1 0 0 0 0
q=1     1 1 0 0 0
q=2     1 1 1 0 0
q=3     1 1 1 1 0
q=4     1 1 1 1 1

Nếu train language model mà quên causal mask, token hiện tại có thể nhìn future token. Đây là data leakage nghiêm trọng: model học shortcut không tồn tại ở inference.

6. Self-Attention

Self-attention nghĩa là Q, K, V đều đến từ cùng một sequence:

Input: [t1, t2, t3, t4]

t1 attend tới [t1, t2, t3, t4]
t2 attend tới [t1, t2, t3, t4]
t3 attend tới [t1, t2, t3, t4]
t4 attend tới [t1, t2, t3, t4]

Với encoder như BERT/PhoBERT, attention thường bidirectional: token được nhìn cả trái và phải, trừ PAD.

Với decoder như GPT/LLaMA/Qwen, attention là causal: token chỉ nhìn quá khứ và chính nó.

7. Vì Sao Attention Parallel Tốt Hơn RNN?

RNN xử lý theo chuỗi:

x1 -> h1 -> h2 -> h3 -> h4

Muốn tính h4, phải có h3; muốn có h3, phải có h2. Training khó parallel theo chiều sequence.

Self-attention dùng matrix multiplication:

Q @ K^T

GPU xử lý matrix multiplication rất tốt, nên toàn bộ token trong sequence có thể được tính đồng thời trong training.

Trade-off:

RNN có dependency tuần tự nhưng memory theo sequence length thường nhẹ hơn.
Attention parallel tốt hơn nhưng attention matrix là seq_len x seq_len.

Nếu seq_len tăng từ 1,024 lên 4,096, attention matrix tăng:

(4096 / 1024)^2 = 16 lần

8. Multi-Head Attention

Một attention head chỉ là một view. Multi-head attention chia embed_dim thành nhiều phần:

embed_dim = num_heads * head_dim

Ví dụ:

embed_dim = 64
num_heads = 4
head_dim = 16

Quy trình:

X
 -> Linear tạo Q/K/V
 -> reshape thành [batch, heads, seq_len, head_dim]
 -> mỗi head tự tính attention
 -> concat heads
 -> output projection về embed_dim

Trực giác:

head 1: quan hệ chủ ngữ - động từ
head 2: entity/reference
head 3: keyword sentiment
head 4: local phrase pattern

Không nên diễn giải từng head quá chắc chắn trong production. Đây chỉ là mental model để hiểu vì sao nhiều head giúp tăng capacity.

9. PyTorch Implementation Gần Production

Đoạn dưới dùng API PyTorch được kiểm tra qua Context7: torch.matmul hỗ trợ batched matrix multiplication theo các chiều cuối, boolean mask cần broadcast được với attention weights, nn.Dropout tự tắt ở eval() mode. Với PyTorch production code thật, cân nhắc torch.nn.functional.scaled_dot_product_attention hoặc module/runtime đã tối ưu trước khi tự viết manual attention.

from __future__ import annotations

import math
import torch
from torch import nn


def causal_mask(seq_len: int, device: torch.device) -> torch.Tensor:
    return torch.tril(torch.ones(seq_len, seq_len, dtype=torch.bool, device=device))


class MultiHeadSelfAttention(nn.Module):
    def __init__(self, embed_dim: int, num_heads: int, dropout_p: float = 0.1) -> None:
        super().__init__()
        if embed_dim <= 0 or num_heads <= 0:
            raise ValueError("embed_dim and num_heads must be positive")
        if embed_dim % num_heads != 0:
            raise ValueError("embed_dim must be divisible by num_heads")
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads
        self.qkv = nn.Linear(embed_dim, 3 * embed_dim)
        self.out_proj = nn.Linear(embed_dim, embed_dim)
        self.dropout = nn.Dropout(dropout_p)

    def _normalize_mask(self, mask: torch.Tensor, batch: int, seq_len: int, device: torch.device) -> torch.Tensor:
        if mask.dtype != torch.bool:
            raise TypeError("attention mask must be boolean, where True means allowed")
        mask = mask.to(device=device)
        if mask.shape == (batch, seq_len):
            return mask[:, None, None, :]
        if mask.shape == (batch, 1, 1, seq_len):
            return mask
        if mask.shape == (batch, self.num_heads, seq_len, seq_len):
            return mask
        raise ValueError(f"unsupported mask shape: {tuple(mask.shape)}")

    def forward(
        self,
        x: torch.Tensor,
        attention_mask: torch.Tensor | None = None,
        causal: bool = False,
    ) -> tuple[torch.Tensor, torch.Tensor]:
        if x.ndim != 3:
            raise ValueError("x must have shape [batch, seq_len, embed_dim]")
        batch, seq_len, embed_dim = x.shape
        if embed_dim != self.embed_dim:
            raise ValueError(f"expected embed_dim={self.embed_dim}, got {embed_dim}")

        qkv = self.qkv(x).view(batch, seq_len, 3, self.num_heads, self.head_dim)
        q, k, v = qkv.permute(2, 0, 3, 1, 4).unbind(0)
        scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.head_dim)

        allowed = None
        if attention_mask is not None:
            allowed = self._normalize_mask(attention_mask, batch, seq_len, x.device)
        if causal:
            c_mask = causal_mask(seq_len, x.device)[None, None, :, :]
            allowed = c_mask if allowed is None else allowed & c_mask
        if allowed is not None:
            if not allowed.any(dim=-1).all():
                raise ValueError("each query must be allowed to attend to at least one key")
            scores = scores.masked_fill(~allowed, torch.finfo(scores.dtype).min)

        weights = torch.softmax(scores, dim=-1)
        weights = self.dropout(weights)
        context = torch.matmul(weights, v)
        context = context.transpose(1, 2).contiguous().view(batch, seq_len, embed_dim)
        return self.out_proj(context), weights

Điểm cần review trong code:

embed_dim % num_heads == 0 để chia head không lệch.
Mask dùng True = allowed, sau đó ~allowed mới bị fill bằng giá trị rất âm.
Tất cả mask tạo trên cùng device với input.
torch.finfo(scores.dtype).min tránh hard-code -1e9 không phù hợp mọi dtype.
nn.Dropout được áp dụng trên attention weights và tự khác behavior giữa train()/eval().
Nếu một query không được attend vào key nào, code raise lỗi thay vì tạo NaN.

10. Test Nhỏ Cần Có

Tối thiểu nên test:

Output shape là [batch, seq_len, embed_dim].
Attention weight shape là [batch, heads, seq_len, seq_len].
Padding mask làm weight ở PAD key gần 0.
Causal mask làm future position có weight gần 0.
eval() tắt dropout nên output lặp lại deterministic hơn train().
Mask sai dtype hoặc shape phải fail fast.

Folder bài học có script attention_demo.py để chạy các test này.

11. Trade-Off Và Performance

Compute

Attention compute xấp xỉ:

O(batch * heads * seq_len^2 * head_dim)

Memory

Nếu materialize attention weights:

batch * heads * seq_len * seq_len * bytes_per_element

Ví dụ gần đúng:

Config	Attention weight memory cho 1 layer
`batch=8`, `heads=12`, `seq_len=512`, FP16	khoảng 50 MB
`batch=1`, `heads=32`, `seq_len=4096`, FP16	khoảng 1 GB
`batch=1`, `heads=32`, `seq_len=8192`, FP16	khoảng 4 GB

Các con số này chỉ tính attention weights nếu materialize đầy đủ, chưa tính Q/K/V, activation khác, FFN, optimizer state hoặc KV cache.

FlashAttention Concept

FlashAttention-style kernel không thay đổi công thức attention. Ý tưởng là tính attention theo block và quản lý memory tốt hơn để không phải materialize toàn bộ ma trận attention lớn trong HBM theo cách naive. Kết quả mong muốn:

Giảm memory footprint.
Tăng throughput trên GPU phù hợp.
Giữ cùng output logic ở mức attention.

Trong PyTorch hiện đại, scaled_dot_product_attention có thể dispatch sang backend tối ưu tùy device, dtype, shape và cấu hình. Với production, ưu tiên runtime đã dùng kernel tối ưu thay vì tự viết Python attention loop.

12. Batch Vs Streaming

Batch Training/Batch Inference

Batch giúp tận dụng GPU tốt hơn:

[batch, seq_len, embed_dim]

Trade-off:

Padding nhiều làm waste compute.
Batch có request dài nhất kéo cả batch dài theo nếu padding naive.
Dynamic batching cần timeout và queue policy.

Streaming Decoder Inference

Decoder-only model sinh từng token:

prefix -> next token -> append -> next token

Nếu mỗi bước tính lại K/V cho toàn bộ prefix thì rất chậm. KV cache lưu Key/Value của token đã qua để mỗi bước chỉ xử lý token mới. Trade-off:

Latency/token giảm.
VRAM tăng theo batch * layers * heads * seq_len * head_dim.
Multi-user streaming cần quản lý cache eviction và request cancellation.

13. Production Concerns

Mask bug: padding mask sai làm model học PAD; causal mask sai gây data leakage.
Long context: prompt dài có thể gây OOM hoặc p99 latency vượt SLA.
Truncation âm thầm: cắt mất instruction hoặc evidence quan trọng.
Dtype mismatch: FP16/BF16/FP32 khác nhau về memory, speed và numerical stability.
Device mismatch: input CPU nhưng model GPU, hoặc mask CPU nhưng tensor GPU.
Attention weight logging: có thể lộ PII nếu log token text; cần redaction.
Model upgrade: thay tokenizer/context length/config là breaking change nếu không có golden eval.
Interpretability: attention chart chỉ hỗ trợ debug, không đủ làm bằng chứng giải thích quyết định rủi ro cao.

14. Guidance Thực Tế

Use case	Attention mode	Gợi ý production
Ticket classification tiếng Việt	Bidirectional encoder	PhoBERT/BERT-style, dynamic padding, eval theo class imbalance
Reranking trong RAG	Cross-attention hoặc encoder pair input	Giới hạn top-k rerank, benchmark latency
Chatbot/assistant	Causal decoder	KV cache, token budget, guardrails, streaming
Long document QA	Retrieval trước, attention sau	Chunking, hybrid search, reranking, citation
Code assistant	Causal decoder long context	Context packing, file selection, truncation audit

15. Tài Liệu Tham Khảo

Attention Is All You Need: https://arxiv.org/abs/1706.03762
The Illustrated Transformer: https://jalammar.github.io/illustrated-transformer/
The Annotated Transformer: https://nlp.seas.harvard.edu/annotated-transformer/
PyTorch docs qua Context7: /pytorch/pytorch, /websites/pytorch_2_11

Bài tập

Mục Tiêu Thực Hành

Bạn sẽ chạy một module multi-head self-attention nhỏ bằng PyTorch để kiểm tra:

Shape của output và attention weights.
Padding mask chặn PAD key.
Causal mask chặn future token.
Dropout khác nhau giữa train() và eval().
Dtype/device được xử lý nhất quán.

Setup

Từ root repo:

python3 lessions/day-13-attention-mechanism/attention_demo.py

Nếu chưa có PyTorch:

python3 -m pip install torch

Bài 1: Đọc Output Shape

Chạy script và ghi lại:

output.shape
weights.shape
device
dtype

Giải thích vì sao:

input:   [batch, seq_len, embed_dim]
weights: [batch, heads, seq_len, seq_len]
output:  [batch, seq_len, embed_dim]

Bài 2: Padding Mask

Trong script, tìm biến padding_mask.

padding_mask = torch.tensor(
    [
        [True, True, True, False, False],
        [True, True, True, True, True],
    ],
    dtype=torch.bool,
    device=device,
)

Yêu cầu:

Giải thích vì sao batch 0 có hai token PAD.
Kiểm tra attention weight trỏ vào key position 3 và 4 của batch 0 phải gần 0.
Đổi mask thành tất cả True và quan sát test nào không còn ý nghĩa.

Bài 3: Causal Mask

Causal mask chặn future token:

query position 1 không được attend key position 2, 3, 4

Yêu cầu:

In attention weights của head 0.
Chỉ ra các vị trí phía trên đường chéo chính.
Giải thích vì sao các vị trí đó phải bằng 0 trong decoder-only model.

Bài 4: Dropout Train/Eval

Script có test:

model.train() -> dropout active
model.eval()  -> dropout disabled

Yêu cầu:

Chạy nhiều lần và ghi lại output.
Giải thích vì sao eval() cần thiết khi inference.
Nếu quên eval() khi serve model có dropout, production risk là gì?

Bài 5: Sửa Code Để Dùng PyTorch SDPA

Không bắt buộc hoàn thành trong Day 13, nhưng nên thử sau khi hiểu manual implementation.

Yêu cầu:

Tạo nhánh thử nghiệm trong ghi chú cá nhân, không cần sửa file bài học.
Thay phần scores -> softmax -> dropout -> matmul bằng torch.nn.functional.scaled_dot_product_attention.
Giữ nguyên shape [batch, heads, seq_len, head_dim].
So sánh output shape và mask behavior.

Gợi ý quan trọng: boolean mask của PyTorch SDPA dùng True để cho phép phần tử tham gia attention, còn manual code trong bài dùng masked_fill(~allowed, very_negative_value).

Quiz

Query, Key và Value khác nhau ở đâu?
Vì sao attention score phải chia cho sqrt(head_dim)?
Padding mask và causal mask giải quyết hai vấn đề khác nhau như thế nào?
Vì sao quên causal mask trong language model là data leakage?
Multi-head attention tăng capacity bằng cách nào?
Vì sao attention train parallel tốt hơn RNN?
Nếu sequence length tăng từ 2,048 lên 8,192, attention matrix tăng bao nhiêu lần?
FlashAttention concept giải quyết điểm nghẽn gì?
Vì sao attention weights không đủ làm bằng chứng explainability?
Dùng self-attention manual trong production cần điều kiện gì?

Đáp Án Gợi Ý

Query biểu diễn nhu cầu tìm context, Key biểu diễn dấu hiệu để được match, Value là nội dung được aggregate.
Để giữ scale score ổn định, tránh softmax saturated và gradient yếu.
Padding mask chặn PAD token; causal mask chặn future token.
Vì token hiện tại nhìn được đáp án tương lai trong training nhưng inference không có thông tin đó.
Nhiều head học nhiều projection/view quan hệ khác nhau, sau đó concat và project về embed_dim.
Attention dùng matrix multiplication cho toàn bộ sequence, GPU parallel tốt hơn dependency tuần tự của RNN.
(8192 / 2048)^2 = 16 lần.
Giảm memory traffic/materialization của attention matrix lớn và tăng throughput trên GPU phù hợp.
Vì output cuối còn qua nhiều layer, FFN, residual, nonlinear transformation và attention có thể không tương ứng trực tiếp với causal explanation.
Cần test mask, giới hạn sequence length, eval đúng, dtype/device nhất quán, runtime tối ưu, monitoring latency/OOM và rollback.

Checklist Nộp Bài

Chạy được attention_demo.py.
Ghi lại shape của output và weights.
Chứng minh padding mask làm PAD key có weight bằng 0.
Chứng minh causal mask làm future attention bằng 0.
Giải thích được dropout khác nhau giữa train() và eval().
Trả lời quiz bằng lời của mình.
Viết một đoạn ngắn: "Dùng attention trong production được không? Nếu có thì cần điều kiện gì?"

Mục Tiêu

Vị Trí Trong Phase 2

Cách Học Trong 2 Giờ

TL;DR

Diagram Nhanh

Mental Model Cho Senior SE

Best Solution Theo Context

Dùng Được Trong Production Không?

Deliverable Cuối Ngày

Checklist Hoàn Thành

Tài liệu

1. Attention Là Gì?

2. Query, Key, Value Step By Step

3. Scaled Dot-Product Attention

Vì Sao Chia Cho sqrt(d_k)?

4. Softmax Biến Score Thành Weight

5. Mask: Padding Mask Và Causal Mask

Padding Mask

Causal Mask

6. Self-Attention

7. Vì Sao Attention Parallel Tốt Hơn RNN?

8. Multi-Head Attention

9. PyTorch Implementation Gần Production

10. Test Nhỏ Cần Có

11. Trade-Off Và Performance

Compute

Memory

FlashAttention Concept

12. Batch Vs Streaming

Batch Training/Batch Inference

Streaming Decoder Inference

13. Production Concerns

14. Guidance Thực Tế

15. Tài Liệu Tham Khảo

Bài tập

Mục Tiêu Thực Hành

Setup

Bài 1: Đọc Output Shape

Bài 2: Padding Mask

Bài 3: Causal Mask

Bài 4: Dropout Train/Eval

Bài 5: Sửa Code Để Dùng PyTorch SDPA

Quiz

Đáp Án Gợi Ý

Checklist Nộp Bài

Vì Sao Chia Cho `sqrt(d_k)`?