References for Learning Transformers

A personal note on capturing the basic ideas and the scope of the Transformers family

3 min readMar 22, 2022

Overview

The Transformer proposed in “Attention Is All You Need” is an encoder/decoder model for sequence-to-sequence problems. The core idea of the Transformer is a set of attention mechanisms that are designed to capture the relationships among the components within one sequence or in two sequences. Considering machine translation as an example, the task of the Transformer is to input a sentence in English and output a sentence in French. When training the Transformer with an Engish/French pair,

the multi-head self-attention layer in the encoder is capturing the words and contexts within a sentence in English,
the multi-head self-attention layer in the decoder is capturing the words and contexts within a sentence in French,
while the encoder-decoder attention layer is capturing the relationships across English and French.

After the training, the Transformer would be able to master two languages and conduct the translation. Because of the generality of such attention mechanisms, the Transformer can be applied to other tasks in natural language processing such as question answering and document summarization. Since the departure of the Transformer, there are a lot of variants covering enhancements on the encoder/decoder/attention architecture and extensions to other application domains such as image, audio, and video.

References

There are tons of articles on the Transformers. My personal favorites are listed below.

1. Visual Guide to Transformer Neural Networks

Part 1. Encoder/Decoder, Word Embedding, Position Embedding
Part 2. Self-Attention, Query/Key/Value, Multi-Head Attention
Part 3. Residual Connection, Layer Normalization, Masked Attention

These videos explain the encoder/decoder/attention architecture of the Transformer from input to output step-by-step. (See links below.)