[
AI
]
Deep Learning for Natural Language Processing
The course slides:
1.Background: deep learning and language
GPT2, BERT, WaveNet
Why is Deep Learning such an effective tool for language processing?
1. Words are not discrete symbols.
-
Multi-head processing
-
Distributed representations
2. Disambiguation depends on context
- Self-attention
3. Important interactions can be non-local
-
Self-attention
-
Multiple players
4. How meanings combine depends on those meanings
-
Skip connections
-
Distributed representations
2.The Transformer
Distributed representation of words
What is the vocab on which our model is going to operate?
The transformer builds on solid foundations
Emergent semantic and syntactic structure in distributed representations
The Transformer: Self-attention over word input embeddings
Compute self-attention for all words in input (in parallel)
Multi-head self-attention
Feedforward layer
Skip-connections - for “top-down” influences
The Transformer: Position encoding of words
-
Add fixed quantity to embedding activations
-
The quantity added to each input embedding unit depends on:
-
The dimension of the unit within the embedding
-
The absolute position of the words in the input
-
3.Unsupervised and transfer learning with BERT
BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding
Masked language model pretraining Transformer encoder Classification layer: Fully connected layer + GELU + Norm
Next sentence prediction pretraining