The course slides:

1.Overview

What are the areas of application?

Computer vision
Text and Speech
Control
What’s made all this possible?
Compute
- specific hardware GPU that’s much faster with respect to very specifiic operations matrix multiplication on which neural networks rely
Data
- There is much more data available
- Models that are data hungry and improve as the amount of data increases
Modulartiy
- Modular blocks that can be arranged in various ways

The deep learning puzzle

All that DL researchers are doing is building the small blocks that can be interconnected in various ways so that they jointly process data

Deep Learning is constructing networks of parameterized functional modules & training them from examples using gradient-based optimization. - Yann LeCun

Each of the computation units(nodes) needs to know

How to adjust this input, if my output needs to change?
What to output?

2.Neural Networks

Basics

Real neuron

Human brain is estimated to contain around 86,000,000,000 of such neurons. Each connected to even thousands others.

Connected to others
Represents simple computation
Has inhibition and excitation connections
Has a state
Outputs spikes
Artificial neurons

The goal of simple artificial neurons models is to reflect some neurophysiological observations, not to reproduce their dynamics.
Easy to compute
Represents simple computation
Has inhibition and excitation connections
Stateless w.r.t time
output real values rather than spikes through time
Linear layer

In Machine Learning linear really means affine. Neurons in a layer are often called units. Parameters are often called weights.
Easy to compute
Collection of artificial neurons
Can be efficiently vectorized
Fits highly optimized hardware (GPU/TPU) Matrix multiplication of 2 matrices in a naive fashion is cubic, and can be made more efficient by using divide and conquer that fits the GPU paradigm extremely well.
Single layer neural networks

Sigmoid activation function

Activation functions are often called non-linearities and are applied point-wise.
Introduces non-linear behavior
Produces probability estimate
Has simple derivatives
Saturates (once saturates, want to be told how to adjust weights any more)
Derivatives vanish (gradients far to the left and right approach 0)
Cross entropy

Cross entropy loss is also called negative log likelihood or logistic loss
Encodes negation of logarithm of probability of correct classification
Composable with sigmoid
Numerically unstable

Being addictive over samples allows for efficient learning. Losses that have the form can be trained with stochastic gradient descent and can scale very well with big datasets.

Softmax

A smooth version of the maximum operation

Softmax is the most commonly used final activation in classification.

It can also be used to have a smooth version of maximum.

Multi-dimensional generalization of sigmoid
Produces probability estimate
Has simple derivatives
Saturates
Derivatives vanish

Softmax + Cross entropy

Widely used not only in classification but also in RL. Cannot represent sparse outputs(sparsemax). Does not scale too well with k- number of classes.

Encodes negation of logarithm of probability of entirely correct classification
Equivalent of multinomial logistic regression model
Numerically stable combination
Not scale well with

Highly dimensional spaces are surprisingly easy to shatter with hyperplanes.

Limitation: Linear models cannot do XOR

Two layer neural networks

1-hidden layer network VS XOR

Hidden layer provides non-linear input space transformation so that final linear layer can classify The hidden layer is used to rotate and then slightly bend the input space with the sigmoid, and preprocess the data so that it becomes linearly separable.

With just 2-hidden neurons we solve XOR

Hidden layer allows us to bent and twist input space

We use linear model on top, to do the classification What makes it possible for neural networks to learn arbitrary shapes?

Universal Approximation Theorem

For any continuous function from a hypercube [0,1] to real numbers, and every positive epsilon, there exists a sigmoid based, 1-hidden layer neural network that obtains at most epsilon error in functional space.

Big enough network can approximate, but not represent any smooth function. The math trick is to show that networks are dense in the space of target functions.

One of the most important theoretical results for Neural Networks
Shows that they are extremely expressive
Tells us nothing about learning (how to learn them)
Size of network grows exponentially

Deep neural networks

Rectified Linear Unit (ReLU)

One of the most commonly used activation functions. Made math analysis of networks much simpler.

Introduces non-linear behavior
Creates piece-wise linear functions
Derivatives do not vanish
Dead neurons can occur
Technically not differentiable at 0

Depth

Number of linear regions that are created by ReLU grows exponentially with depth and polynomially with width.

Expressing symmetries and regularities is much easier with deep model than wide one.
Deep model means many non-linear composition and thus harder learning. Why depth really matters? ReLU networks can be seen as a method to keep folding the input space on top of each other, which has two effe -
One way to represent symmetry. That only depth can give you. Would need exponential many neurons to represent the same invariance. (why depth really matters)
Neural networks as computational graphs

Extremely flexible: There is no difference between weight or input into a node in a computational graph. Can substitute weight with another network.

3.Learning

Linear algebra recap

Gradient/Jacobian

Gradient descent recap

Choice of learning rate is critical. Main learning algorithm behind deep learning. Many modifications: Adam, RMSProp, …

Gradients of the sum is the sum of the gradients.

Works for any “smooth enough” function
Can be used on non-smooth targets but with less guarantee
Converges to local optimum (guaranteed when smooth)

Neural networks as computational graphs - API

Forward pass: Given input, what is the output?
Backward pass: Jacobian with respect to input

Gradient descent and computational graph

Chain rule, backprop and automatic differentiation

Linear layer as a computational graph

Backward pass is a computational graph itself.

ReLU as a computational graph

We usually put gradient in 0 to be equal to 0.

Can be seen as gating the incoming gradients. The ones going through neurons that were active are passed through, and the rest zeroed.

Softmax as a computational graph

Since exponents of big numbers will cause overflow, it is rarely explicitly written like this.

Backwards pass is essentially a difference between incoming gradient and output.
Cross entropy as a computational graph

Though it is a loss, we could still multiply its backwards by another incoming errors.

Each of these operations on its own is numerically unstable and numerically stable solutions are desired. Each of these combinations is implemented in a way that is numerically stable, and all we need to do is to pick the combination you want from the lookup table.
composing cross entropy with either Sigmoid or Softmax.

work with any kind of neural network relies on keep composing around the same algorithm optimization method that is going to converge to some local minimum not necessarily a perfect model but it’s gonna learn something

4. Pieces of the Puzzle

Max as a computational graph

Used in max pooling.

Gradients only flow through the selected element. Consequently we are not learninig how to select.

Conditonal execution as a computational graph

Backwards pass is gated in the same way forward one is
We can learn conditionals themselves too, just use softmax.

5.Practical Issues

Overfitting and regularization

Classical results from statistics and statistical learning theory which analyses the worst case scenario.

Techniques

Lp regularization
Dropout
Noising data
Early stopping
Batch/Layer Norm
As your model gets more powerful, it can create extremely complex hypotheses, even if they are not needed.
Keeping things simple guarantees that if the training error is small, so will the test be.
As models grow, their learning dynamics changes, and they become less prone to overfitting.
New, exciting theoretical results, also mapping these huge networks onto Gaussian Processes.

Model complexity is not as simple as number of parameters.

Even big models still can benefit from regularization techniques.
We need new notions of effective complexity of our hypotheses classes
Diagnosing and debugging
Initialization
- Won’t learn well for bad
Overfit small sample
Monitor training loss
Monitor weights norms and NaNs
Add shape asserts
Start with Adam (3e-5 is a magical learning rate :) )
Change one thing at the time
6.Multiplicative Interactions

What MLPs cannot do?

Multiplicative interactions

Being able to approximate something is not the same as represent it.
Multiplicative units unify attention, metric learning and many others.
They enrich the hypothesis space of regular neural networks in a meaningful way.

If you want to do research in fundamental building blocks of Neural Networks, do not seek to marginally improve the way they behave by finding new activation function.

Ask yourself what current modules cannot represent or guarantee right now, and propose a module that can.

1.Overview

What are the areas of application?

What’s made all this possible?

The deep learning puzzle

2.Neural Networks

Basics

Real neuron

Artificial neurons

Linear layer

Single layer neural networks

Sigmoid activation function

Cross entropy

Softmax

Softmax + Cross entropy

Two layer neural networks

1-hidden layer network VS XOR

Universal Approximation Theorem

Deep neural networks

Rectified Linear Unit (ReLU)

Depth

Neural networks as computational graphs

3.Learning

Linear algebra recap

Gradient descent recap

Neural networks as computational graphs - API

Gradient descent and computational graph

Chain rule, backprop and automatic differentiation

Linear layer as a computational graph

ReLU as a computational graph

Softmax as a computational graph

Cross entropy as a computational graph

4. Pieces of the Puzzle

Max as a computational graph

Conditonal execution as a computational graph

5.Practical Issues

Overfitting and regularization

Diagnosing and debugging

6.Multiplicative Interactions

What MLPs cannot do?

Multiplicative interactions