(Roadmap — content to be written)
This chapter covers:
The attention mechanism: queries, keys, values
Scaled dot-product attention
Multi-head attention
Positional encoding
Why attention handles long-range dependencies better than convolution
Depends on: Chapter 8 (dot products), Chapter 11 (CNNs as comparison)