References for Learning Transformers
A personal note on capturing the basic ideas and the scope of the Transformers family
Overview
The Transformer proposed in “Attention Is All You Need” is an encoder/decoder model for sequence-to-sequence problems. The core idea of the Transformer is a set of attention mechanisms that are designed to capture the relationships among the components within one sequence or in two sequences. Considering machine translation as an example, the task of the Transformer is to input a sentence in English and output a sentence in French. When training the Transformer with an Engish/French pair,
- the multi-head self-attention layer in the encoder is capturing the words and contexts within a sentence in English,
- the multi-head self-attention layer in the decoder is capturing the words and contexts within a sentence in French,
- while the encoder-decoder attention layer is capturing the relationships across English and French.
After the training, the Transformer would be able to master two languages and conduct the translation. Because of the generality of such attention mechanisms, the Transformer can be applied to other tasks in natural language processing such as question answering and document summarization. Since the departure of the Transformer, there are a lot of variants covering enhancements on the encoder/decoder/attention architecture and extensions to other application domains such as image, audio, and video.
References
There are tons of articles on the Transformers. My personal favorites are listed below.
1. Visual Guide to Transformer Neural Networks
Part 1. Encoder/Decoder, Word Embedding, Position Embedding
Part 2. Self-Attention, Query/Key/Value, Multi-Head Attention
Part 3. Residual Connection, Layer Normalization, Masked Attention
These videos explain the encoder/decoder/attention architecture of the Transformer from input to output step-by-step. (See links below.)
2. Tutorial and Survey of Transformers
The tutorial and the survey provide the basic ideas of the Transformers and some of the variants.
3. Hugging Face Transformers Library
- What Is It and How to Use It (BERT and GPT-2)
- Jupyter Notebook (BERT)
- Tutorials (BERT fine-tuning, ViT/DETR inference and fine-tuning)