AI
]
Attention and Memory in Deep Learning
The course slides:
Attention is about igonoring things. It is about putting more things in the neural network; removing information so it is possible to focus on specific parts.
They respond more strongly to some parts to others. ignore irrelevant detail in the background
matrix of partial derivatives: sensitivity of the network outputs with respect to the inputs
The back prop calculation that’s used for gradient descent can be repurposed to analyse the sensitivity of the network.
Instead of passing through the errors with respect to some loss function, you set the error equal to output activations themselves, then perform back prop
What pieces of info the network is really focusing on
Duelling Network
network that is applied to play Atari games and the input is video sequence.
output two headed one head attempts to predict the value of the state, as is kind of normal for RL,
tHE other head attemps to predict the action advantage - the differential givewhether a particular action would make value higher tje degree over expectation over other actions racing game: overtake as many cars as possible
Even for the same data, we get a very different sensitivity pattern depending on which tasks we are trying to perform.
The attention mechanism is allowing it to process the same data in two very different ways.
Attention allows you to ignore some parts of the data and focus on others.
Recurrent Networks
Take sequences as input output sequences
feedback connection that agive them memory about
Memory can be thought as attention through time.
How are they using the memory to solve the task?
sequential jacobian: instead of 2d matrix o f partial deri 3d matrix where the 3rd dimension is time
What we care about mostly is how sensitive is the network
If we look at the sequential Jacobian, we can see there is a peak sensitivity aorund here which roughly corresponds to letter i is written.
It doesn’t extend so far back and
“-ing” is common helps to disambiguous
lifed the pen off page and put the dot network is sensitive to that point
quantifiable sense of the degree to which it is using the information
Machine Translation
Words may appear in a completely different order in a different language.
Explicit Attention
- Computational efficiency
- no need to process the data; save
- Scalability
- take a fixed size part of an image-can scale to any sized image so that the resolution of input does not alter the architecture of network
- sequential
- foveal gaze moving around a static image
- a sequence of sensory inputs
- doing this can improve robustness
- network with sequences of gllimpse/foveal were more robust to adversarial examples than ordinary cnns that looked at entire image
- Interpretability
- require making a hard decision and choose part of data to look data
- can analyze a bit more clearly what the network is actually using
- implicit: use Jacobian as a guide-not necessarily a reliable signal
A loop is going on that influences the data it receives
Even if the network itself is feedforward, the system if recurrent.
It contains loop!
Define a probability distribution over glimpse.
attention vector a
simplest case: split to each piles softmax outputs are probabilities of picking each tile
HARD DECISION: no longer have the complete gradient as to what the network has done; we’ve got a stochastic policy, in RL terms, we are sampling
how to get a gradient with respect to discreet samples using REINFORCE
RL methods, designed for getting a training signal trhough discrete policy
anytime there is a non-differentiable module in there
using hard attention vs implicit attention present in nn
generally we want to something more complex than softmax over piles
Complex Glimpses
squashed down
mimicking the effect of human eye high resolution at the center of gaze informaiton at proper is sufficient to alert ou to somthign
find digit in clutter
scalability a sequential glimpse is more scalable is that you can use it, to represent multipile objects
multiple number from street address.
the network moves its attention around really important parts of indication what parts image
Soft Attention
makes end to end training possible
have all the data and just need to make a decision on what to focus on what not; no need to make a decision