- Meant to mimic cognitive attention- Picks out relevant bits of information
- Use gradient descent
 
- Used in 90s- Multiplicative modules
- Sigma pi units
- Hyper-networks
 
- Draw from relevant state at any preceding point along sequence
- Attention layer access all previous states and weighs according to learned measure of relevance- Allows referring arbitrarily far back to relevant tokens
 
- Can be addd to RNNs
- In 2016, a new type of highly parallelisable decomposable attention was successfully combined with a feedforward network- Attention useful in of itself, not just with RNNs
 
- Transformers use attention without recurrent connections- Process all tokens simultaneously
- Calculate attention weights in successive layers
 
Scaled Dot-Product
- Calculate attention weights between all tokens at once
- Learn 3 weight matrices- Query
- Key
- Value
 
- Query
- Word vectors- For each token, , input word embedding, - Multiply with each of above to produce vector
 
- Query Vector
- Key Vector
- Value Vector
 
- For each token, , input word embedding, 
- Attention vector- Query and key vectors between token and
- Divided by root of dimensionality of key vectors
- Pass through softmax to normalise
 
-  and  are different matrices- Attention can be non-symmetric
- Token  attends to  ( is large)- Doesn’t imply that attends to ( can be small)
 
 
- Output for token  is weighted sum of value vectors of all tokens weighted by - Attention from token to each other token
 
- are matrices where th row are vectors respectively
- softmax taken over horizontal axis