AI
]
Advanced Models for Computer Vision
The course slides:
Goal of the lecture: Know how to redefine the building blocks to perform different visual tasks using different inputs and different forms of supervision.
1.Supervised image (beyond classification)
Task definitions | Train and eval | Tricks of the trade |
---|---|---|
Objecy detection | Models and losses | Hard negative mining |
Semantic segmentation | Metrics and benchmarks | Transfer learning |
Tasks - increasing granularity
classification -> object detection -> semantic segmentation -> instance segmentation
Object detection
A Multitask problem: Classification & Localization
Inputs | Targets |
---|---|
RGB Image HW3 | Class label & Object bounding box (for all the objects present in the scene) |
Bounding box prediction
How to learn to predict real-valued bounding box coordinates?
Recap: Softmax + cross entropy
Assign data points to categories; output is discrete.
Mistakes are not quantifiable in classification; the idea is not ordered.
In classification, the output is discrete; in regression, the output is continuous.
Quadratic loss for regression
Minimize the MSE over samples.
\[l_2(x,t) = |x-t|^2\]The question is: How to deal with multiple targets?
Classification then regression
Convert regression into classification, by discretizing the output values, and then refine through regression.
Bin the output space and then use one hot label. Once in the local area then regress.
Summary
Property | Classification | Regression |
---|---|---|
Basic | map inputs to predefined classes | map inputs to continuous values |
Output | discrete values | continuous values |
Nature of the data | unordered data | ordered data |
Algorithms | logistic regression, decision trees, neural networks | linear regression, neural networks |
Case study 1: Faster R-CNN
Two-stage detector
- Identify good candidate boundary boxes
- Discretize bbox space
- anchor points for (x,y)
- scales and ratios for (h,w)
- n candidates per anchor
- predict objectness score for each box
- sort and keep top K Deep learning is very flexible, and we can replace those pieces in the puzzle. But we have to put functions that are differentiable so that we can backprop through. Non-differentiable building block
- Discretize bbox space
- Classify and refine
Case study 2:RetinaNet - one-stage detector
Most of the candidate boxes are easy negatives: poor learning signal
Issue with one-stage detectors
- Most of the candidate bboxes are background. easy to classify
- The accumulated loss of the many easy examples overwhelms the loss of rare useful examples
- Faster R-CNN in first stage prunes easy negatives
- One stage detectors employ hard negative mining heuristics.
- RetinaNet uses Focal Loss
Semantic segmentation
Bounding boxes are not good representations for certain types of objects.
We need more refined representation.
Inputs | Targets |
---|---|
RGB Image | Class label for every pixel |
Case study: U-NET
- Encoder-decoder model
- Skip connections to preserve details
Evaluation metrics
Classification
- Accuracy: percentage of correct predictions Top-1: top prediction is the correct class Top-5: correct class is in top-5 predictions
Object detection and segmentation
- Intersection-over-union non-differentiable: used only for evaluation
Trick of the trade
Transfer learning
Let D = {x, p(x)}, X = {x1,..xn} be a domain and T={Y, f(‘)}, f(.)=y a task defined on this domain Given a source domain and task (Ds Ts) and a target domain and task (Dt tt ) reuse knowledge learnt by f in ft
Intuition is to reuse knowledge - features are shared across tasks and datasets.
Transfer learning across different domains
Sim2Real
- Train in simulation using RL- Ds
- Use Automatic Domain Randomization: data augmentation + hard negative mining
- Test in real world - Dt
2.Supervised (Beyond singe image input) Classification
Experiment
Video
- Motion - cues for object recognition during learning
- Natural data augmentation: translation, scale, 3D rotation, camera motion, light changes
Optical flow estimation
Inputs | Targets |
---|---|
Pairs of RGB Image | Dense flow map; |
2D translation displacement |
Case study: FlowNet
- Encoder-decoder architecture similar to U-NET
- Supervised training
- Loss: Euclidean distance
- Flying chairs dataset
- Sim2Real transer
Video models using 3D convolutions
Video as a volume
- stack frames T H W 3
- apply 3D convolutions
Properties of 3D convolutions
- 3D convolutions are non-causal
- masked 3D convolutions are causal
Strided, dilated, padded, … convolutions apply in 3D as well.
Inputs | Targets |
---|---|
RGB Video T H W 3 | |
(optional) flow map | action label |
Transfer learning returns
Intuition: a tiled image is a video of a static scene, filmed with a fixed camera
Challenges in video processing
- Difficult to obtain labels
- Large memory requirements
- High latency
- High energy consumption
Improve efficiency of video models
- Inspiration from biological systems
- Maximize parallelism to increase throughput and reduce latency
- Exploit redundancies in the visual data to obtain frugal models
3.(Beyond strong supervised) image classification
Labeling is tedioius - Research topic in its own
Self-supervision - Metric learning
Standard losses (eg. cross entropy, mse)
- learn mapping between inputs and output distribution/values
Metric learning
- learn to predict distances between inputs given some similiarity measure (same person-different photos/image retrieval)
- Contastive loss
- Triplet loss
- State-of-the-art on representation learning
Applications
- Multimodal self-supervised representation (image+sound)
- Information retrieval
- Low-shot face recognition
Metric learning
Contrastive loss(margin loss)
Dataset <r0, r1, y>
Triplet loss Dataset <r0, r1, rn>
- better than contrastive loss
- relative distances more meaningful than a fixed margin
- hard negative mining to select informative triplets
New state-of-the-art in representation learning
- Composition of data augmentations
- Learnable non-linear transformation
- Larger mini-batches and longer training
4.Open questions
- Is vision solved? What does it mean to solve vision?
- human level scene understanding - what benchmarks?
- How to scale systems up?
- model parallelism, better hardware, less supervision - more common sense
- What are good visual representations for action?