AI
]
Convolutional Neural Networks for Image Recognition
The course slides:
1.Background
How can we feed images to a neural network?
A digital image is a 2D grid of pixels. A neural network expects a vector of numbers as input. Locality: nearby pixels are more strongly correlated. Translation invariance: meaningful patterns can occur anywhere in the image.
Taking advantage of topological structure
Weight sharing: use the same network parameters to detect local patterns at many locations in the image Hierarchy: local low-level features are composed into larger, more abstract features
The ImageNet Challenge
- Major computer vision benchmark
- 1.4M images, 1000 classes
- Image classification
Building blocks
From fully connected to locally connected
Implementation: the convolution operation
The kernel slides acrpss the image and produces an outpt value at each position. We convolve multiple kernels and obtain multiple feature maps or channels
We convolve multiple kernels and obtain multiple feature maps or channels.
Inputs and outputs are tensors
3D objects that have width, height, channels Each output channel of the convolution is connected to all the input channels as well.
Variants of the convolution operation
Valid convolution: output size = input size - kernel size + 1 The output will be slightly smaller than the input. Full convolution (with added padding): output size = input size + kernel size - 1 The output will be slightly greater than the input. same convolution: output size = input size This ensures feature map size will have the same size as the image. Only makes sense if kernel has odd size strided convolution: kernel slides along the image with a step > 1 Pro: A lot cheaper to compute! Dilated convolution: kernel is spread out, step>1 between kernel elements Pro:
- can be computed more efficiently
- no need to pad zeros
Pooling
Def:compute mean of max over small windows to reduce resolution.
3.Convolutional neural networks
Stacking the building blocks
- CNNs or convnets
- Up to 100s of layers
- Alternate convolutions and pooling to create a hierarchy : higher layers in the model will extract more abstract features from the image.
###
4.Going deeper: case studies
LeNet-5
A convnet for handwritten digit recognition. Not suitable for deeper layers.
AlexNet
Architecture: 8 layers, ReLU, dropout, weight decay(regularizer) Infrastructure: large dataset, trained 6 days on 2 GPUs Key innovation:
- ReLU!! Dropout.
- No need to pair convolution layer with pooling layers
Deeper is better
- Each layer is a linear classifier by itself
- More layers - more nonlinearities
- What limits the number of layers in convnets?
VGGNet
Stack many convolutional layers before pooling. Use same convolutions to avoid resolution reduction. (reduced in pooling)
VGGNet: stacking 3 3 kernels
Architecture: up to 19 layers, 3 3 kernels only, same convolutions Infrastructure: trained for 2-3 weeks on 4 GPUs (data parallelism not model parallelism)
VGGNet: error plateaus after 16 layers
Challenges of depth
- Computational complexity
- Optimization difficulties
Improving optimization
- Careful initialization
- Random initialization will not work
- Exploding gradient
- Have to do some symmetry breaking
- Vanishing gradient
- Sophisticated optimizers
- Normalization layers
- Scale the activation to the right range for optimization to be easy
- Essential!
- Network design
- ResNets
- Adding residual
GoogLeNet
Inception
Batch normalization
- Reduces sensitivity to initialization.
- More robust to large learning rate
- Introduces stochasticity and acts as a regularizer (introduce noise in the model can make it more robust)
- bug
ResNet: residual connections
Residual connections facilitate training deeper networks.
DenseNet: connect layers to all previous layers
Squeeze-and-excitation networks
Features can incorporate global context.
AmoebaNet: neural architecture search
- Architecture found by evolution
- Search acyclic graphs composed of predefined layers
Reducing complexity
- Depthwise convolutions
- Separable convolutions
- Inverted bottlenecks
5.Advanced topics
Data augmentation
- By design, convnets are robust against translation
- Data augmentation makes them robust against other transformations: rotation, scaling, shearing, warping.
Other topics to explore
- Pre-training and fine-tuning
- Group equivalent convnets: invariance to e.g. rotation
- Recurrence and attention: other building blocks to exploit topological structure. (Convolution is not the only thing we can do to exploit grid structure!)
Beyond image recogniiton
What else can we do with Convnets?
Generative models of images
- Generative adversarial nets
- Variational autoencoders
- Autoregressive models
More convnets
- Representation learning and self-supervised learning
- Convnets for video, audio, text, graphs
Convolutional neural networks replaced handcrafted features with handcrafted architectures.
Prior knowledge is not obsolete: it is merely incorporated at a higher level of abstraction.