[ AI ]

Deep Learning for Natural Language Processing

Jul 5, 2020

The course slides:

1.Background: deep learning and language

GPT2, BERT, WaveNet

Why is Deep Learning such an effective tool for language processing?

1. Words are not discrete symbols.

Multi-head processing
Distributed representations

2. Disambiguation depends on context

Self-attention

3. Important interactions can be non-local

Self-attention
Multiple players

4. How meanings combine depends on those meanings

Skip connections
Distributed representations

2.The Transformer

Distributed representation of words

What is the vocab on which our model is going to operate?

The transformer builds on solid foundations

Emergent semantic and syntactic structure in distributed representations

The Transformer: Self-attention over word input embeddings

Compute self-attention for all words in input (in parallel)

Multi-head self-attention

Feedforward layer

Skip-connections - for “top-down” influences

The Transformer: Position encoding of words

Add fixed quantity to embedding activations
The quantity added to each input embedding unit depends on:
- The dimension of the unit within the embedding
- The absolute position of the words in the input

3.Unsupervised and transfer learning with BERT

BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding

Masked language model pretraining Transformer encoder Classification layer: Fully connected layer + GELU + Norm

Next sentence prediction pretraining

4.Grounded language learning at DeepMind: