4C16: Deep Learning and its Applications
Module Information
Module Descriptor
Prerequisites
I Introduction to Machine Learning
Introduction
Deep Learning, Machine Learning, A.I.
Early Deep Learning Successes
Image Classification
Scene Understanding
Image Captioning
Machine Translation
Multimedia Content
Game Playing
Reasons of a Success
Global Reach
Genericity and Systematicity
Simplicity and Democratisation
Impact
In Summary
1
Linear Regression/Least Squares
1.1
Model and Notations
1.2
Optimisation
1.3
Least Squares in Practice
1.3.1
A Simple Affine Example
1.3.2
Transforming the Input Features
1.3.3
Polynomial Fitting
1.4
Underfitting
1.5
Overfitting
1.6
Regularisation
1.7
Maximum Likelihood
1.8
Loss, Feature Transforms, Noise
1.8.1
Example 1: Regression Towards the Mean
1.8.2
Example 2
1.9
Take Away
2
Logistic Regression
2.1
Introductory Example
2.2
Linear Approximation
2.3
General Linear Model
2.4
Logistic Model
2.5
Maximum Likelihood
2.6
Optimisation: Gradient Descent
2.7
Example
2.8
Multiclass Classification
2.9
Multinomial Logistic Regression
2.10
Softmax Optimisation
2.11
Take Away
3
Know your Classics
3.1
k-nearest neighbours
3.2
Decision Trees
3.2.1
See Also
3.3
Linear SVM
3.4
No Free-Lunch Theorem
3.5
Kernel Trick
3.5.1
The Problem with Feature Expansions
3.5.2
Step 1: re-parameterisation
3.5.3
Step 2: the Kernel Functions
3.5.4
Understanding the RBF
3.5.5
Support Vectors
3.5.6
What does it look like?
3.5.7
Remarks
3.6
Take Away
3.6.1
See Also
4
Evaluating Classifier Performance
4.1
Metrics for Binary Classifiers
4.1.1
Confusion Matrix
4.1.2
Recall/Sensitivity/True Positive Rate (TPR)
4.1.3
Precision
4.1.4
False Positive Rate (FPR)
4.1.5
Accuracy
4.1.6
F1 Score
4.1.7
You Need Two Metrics
4.1.8
ROC curve
4.1.9
ROC-AUC
4.1.10
Average Precision
4.2
Multiclass Classifiers
4.3
Training/Validation/Testing Sets
4.4
Take Away
II Deep Neural Networks
5
Feedforward Neural Networks
5.1
What is a (Feed Forward) Neural Network?
5.1.1
A Graph of Differentiable Operations
5.1.2
Units and Artificial Neurons
5.2
Biological Neurons
5.3
Deep Neural Networks
5.4
Universal Approximation Theorem
5.5
Example
5.6
Training
5.7
Back-Propagation
5.7.1
Computing the Gradient
5.7.2
The Chain Rule
5.7.3
Back-Propagating with the Chain-Rule
5.7.4
Vanishing Gradients
5.8
Optimisations for Training Deep Neural Networks
5.8.1
Mini-Batch and Stochastic Gradient Descent
5.8.2
More Advanced Gradient Descent Optimizers
5.9
Constraints and Regularisers
5.9.1
L2 regularisation
5.9.2
L1 regularisation
5.10
Dropout & Noise
5.11
Monitoring and Training Diagnostics
5.12
Take Away
5.13
Useful Resources
6
Convolutional Neural Networks
6.1
Convolution Filters
6.2
Padding
Example
6.3
Reducing the Tensor Size
6.3.1
Stride
6.3.2
Max Pooling
Example
6.4
Increasing the Tensor Size
6.5
Architecture Design
6.6
Example: VGG16
6.7
Visualisation
6.7.1
Retrieving images that maximise a neuron activation
6.7.2
Engineering Examplars
6.8
Take Away
6.9
Useful Resources
III Advanced Architectures
7
Advances in Network Architectures
7.1
Transfer Learning
7.1.1
Re-Using Pre-Trained Networks
7.1.2
Domain Adaption and Vanishing Gradients
7.1.3
Normalisation Layers
7.1.4
Batch Normalisation
7.2
Going Deeper
7.2.1
GoogLeNet: Inception
7.2.2
ResNet: Residual Network
7.3
A Modern Training Pipeline
7.3.1
Data Augmentation
7.3.2
Initialisation
7.3.3
Optimisation
7.3.4
Take Away
8
Recurrent Neural Networks
8.1
A Feed Forward Network Rolled Out Over Time
8.2
Application Example: Character-Level Language Modelling
8.3
Training: Back-Propagation Through Time
8.4
Dealing with Long Sequences
8.4.1
LSTM
8.4.2
GRU
8.4.3
Gated Units
8.5
Application: Image Caption Generator
8.6
Take Away
8.7
Limitations of RNNs and the Rise of Transformers
9
Generative Models
9.1
Generative Adversarial Networks (GAN)
9.2
AutoEncoders
9.2.1
Definition
9.2.2
Examples
9.2.3
Dimension Compression
9.2.4
Variational Auto Encoders (VAE)
9.2.5
Multi-Tasks Design
9.3
Deep Auto-Regressive Models
9.4
Take Away
10
Attention Mechanism and Transformers
10.1
Motivation
10.1.1
The Problem with CNNs and RNNs
10.1.2
The Problem with Positional Dependencies
10.2
Attention Mechanism
10.2.1
Core Mechanism of a Dot-Product Attention Layer
10.2.2
No-Trainable Parameters
10.2.3
Self-Attention
10.2.4
Computational Complexity
10.2.5
A Perfect Tool for Multi-Modal Processing
10.2.6
The Multi-Head Attention Layer
10.2.7
Take Away (Attention Mechanism)
10.3
Transformers
10.3.1
an Encoder-Decoder Architecture
10.3.2
Positional Encoder
10.3.3
Take Away (Transformers)
11
Large Language Models
11.1
Basic Principle
11.2
Building Your Own LLM (in 3 easy steps)
11.2.1
Scrape the Internet
11.2.2
Tokenisation
11.2.3
Architecture: All You Need is Attention
11.2.4
Training: All You Need is 6,000 GPUs and $2M
11.2.5
Fine-Tuning: Training the Assistant Model
11.2.6
Summary: How to Make a Multi-Billion Dollar Company
11.3
Safety, Prompt Engineering
11.3.1
Measuring Bias and Toxicity
11.3.2
Prompt Hacking
11.3.3
Prompt Engineering
11.4
Emergent Features
11.4.1
Emergent Features: An Illusion of Scale?
11.5
The Future of LLMs
11.5.1
Scaling Laws
11.5.2
Artificial Generate Intelligence
11.5.3
The Future of LLMs: Climate Change
11.6
Take Away
Conclusion
Appendix
A
Notes
A.1
Universal Approximation Theorem
A.2
Why Does
\(L_1\)
Regularisation Induce Sparsity?
A.3
Kernel Trick
References
Published with bookdown
Deep Learning and its Applications
Conclusion