Introduction

I aim to summarize the Attention Is All You Need paper after reading various blogposts and watching paper summary videos.

Machine Language Translation(MLT) Task

Lets start off by taking the example of Machine Language Translation(MLT) Task. Traditionally the approach has been to use RNNs (LSTMs) for this task.

Source-Sentence -> Encoder -> Embeddings -> Decoder -> Dest-Sentence

Source sentence is passed through an RNN Encoder, where the last hidden state is picked as an embeddings and then passed to a Decoder which finally outputs the translated sentence.

Attention Mechanism and Issues with RNNs

Attention mechanism was introduced to improve the performance of RNNs.

Issues with RNNs were :

How Attention Helps ?

Basic idea is that a decoder in each step would output a bunch of keys. [d] - k1 , k2, k3 .. …. Kn ( these keys would index the hidden states via softmax architecture )

Transformer Architecture

The paper proposes the Transformer architecture which has two components :

  1. Encoder
  2. Decoder

The Source-Sentence goes in the Inputs and the part of the sentence translated till now goes in the Outputs part.

Components of the Transformer :

Attention Blocks

There are a total of 3 attention blocks in the model :

The 3 connections are :

  1. Keys(K) - Output of the encoding part of Source Sentence
  2. Values(V) - Output of the encoding part of Source Sentence
  3. Queries(Q) - Output of the encoding part of Target Sentence

Q and K have a dot product. In most cases in high dimensions, they will be at 90 degrees and their dot-products will be zeroes. But if the vectors are aligned ( in the same direction), their dot-product will be high ( non-zero). Dot-product is basically the angle between two vectors.

We have a bunch of keys and each key has an associated value.

softmax(<K|Q>) is a kind of indexing scheme to pick the appropriate value.

Q - I would like to know certain things.
K - They are indexes.
V - They have the attributes.

Intuition

High Level Summary

Understanding Self-Attention

Step 1

Encoders Input Vectors - A vector for each word

For each word, we create a :

Vectors are created by multiplying the embedding with 3 matrices that we trained which are W(q), W(k), W(v).

Step 2

If there are n words in a sentences with keys (k1, k2, k3 ….. kn) then the score is calculated for each word as follows :

–> q1 * k1, q1 * k2, q1 * k3 …. q1 * kn

Matrix Calculation of Self-Attention

Understanding Multi-Headed-Attention

Multi headed means we have multiple sets of Q/K/V weight matrices. Transformer uses 8 sets for each encoder/decoder :

Diagram below summarizes the complete multi-headed attention process.

Representing The Order of the Sequence Using Positional Encoding

Other Aspects

References

The first half of this blog post are notes from Yannic’s video explanation and the remaining parts are taken from The Illustrated Transformer blog.

  1. Video Explanation by Yannic Kilcher : YouTube Link
  2. The Illustrated Transformer by Jay Alammar : Blog
  3. Attention Is All You Need paper.
  4. The Annotated Transformer by harvardnlp is also a great resource which is a “annotated” version of the paper in the form of a line-by-line code implementation.