Chapter 14 — Vision Transformers - Signals to Transformers

(Roadmap — content to be written)

This chapter covers:

Patch embeddings: the image as a sequence
ViT architecture
Comparison with CNNs: inductive bias vs data efficiency
Positional encoding for 2D patches

Depends on: Chapters 11, 13