CAP 4630 Artificial Intelligence

Undergraduate course on ML/AI at the University of Central Florida.

Overview

Fundamental machine learning concepts

Python, NumPy, and matplotlib

Effect of learning rate on gradient descent for finding minima of univariate functions

Let's examine what could go wrong when applying gradient descent with a poorly chosen learning rate. We could fail to find any solution due to divergence or we could get stuck in a bad local minimum. The following notebook allows us to apply gradient descent for finding minima of univariate functions. (Univariate means that the functions depend on only one variable.)

Notebook for experimenting with different learning rates

Visualization of bivariate functions

The loss function for a deep neural network depends on millions of parameters. Such functions are called multivariate because they depend on multiple variables. It is no longer possible to easily visualize multivariate functions.

The following notebooks present two methods for visualizing bivariate function, that is, those that depend on exactly two variables. Such functions define surfaces in 3D. Think of the surface of a mountain range.

Linear regression using gradient descent - numpy implementation

Mathematical derivation of gradient for linear regression

In the first implementation, we consider the weight and bias separately and implement stochastic gradient descent. It is easy to see the correspondance between the code and the mathematical expression for the gradient (see section 1 of the above notes).

Notebook for solving linear regression using stochastic gradient descent

In the second implementation, we combine the weight and bias into one vector. We also consider three versions of gradient descent: batch, mini-batch, and stochastic gradient descent. We use a vectorized implementation, that is, all data in a batch is processed in parallel. It is more difficult to see the correspondance between the code and the mathematical expression for the gradient (see subsection 2.2 of the above notes).

Notebook for solving linear regression using gradient descent (batch, mini-batch, and stochastic)

This vectorized implementation of gradient descent for linear regression with a single feature can be generalized to linear regression with multiple features (you have to do this for n=2 for one of the homework problems).

Linear regression using the normal equation - numpy implementation

There is a closed-form solution for choosing the best weights and bias for linear regression. The optimal solution achieves the smallest squared error loss. I will not cover this in class. If you are interested, you can find more details in the notes Linear regression using the normal equation.

TensorFlow and Keras

Keras is a high-level deep learning API that allows you to easily build, train, evaluate, and execute all sorts of neural networks. Its documentation (or specification) is available at https://keras.io. The reference implementation https://github.com/keras-team/keras also called Keras, was developed by Francois Chollet as part of a research project and released as an open source project in March 2015. To perform the heavy computations required by neural networks, this reference implementation relies on a computation backend. At present, you can choose from three popular open source deep learning libraries: TensorFlow, Microsoft Cognitive Toolkit (CNTK), and Theano. Therefore, to avoid any confusion, we will refer to this reference implementation as multibackend Keras.

Since late 2016, other implementations have been released. You can now run Keras on Apache MXNet, Apple's Core ML, JavaScript or TypeScript (to run Keras code in a web browser), and PlaidML (which can run on all sorts of GPU devices, not just Nvidia).

TensorFlow 2 itself now comes bundled with its own Keras implementation, tf.keras. It only supports TensorFlow as the backend, but it has the advantage of offering some very useful extra features: for example, it supports TensorFlow's Data API, which makes it easy to load and preprocess data efficiently.

`tf.keras`

In this course, we will use TensorFlow 2.x and tf.keras. Always make sure that you use correct versions of TensorFlow and Keras.

Notebook showing how to load TensorFlow 2
Notebook showing how to load tf.keras
TensorFlow 2
keras.io is the documentation for the multibackend Keras implementation. You have to tweak the code examples from keras.io to use them with TensorFlow 2.x and tf.keras.

Linear regression - Keras implementation

Let's see how we can solve the simplest case of linear regression in Keras.

Notebook for solving linear regression

Keras datasets

We are going to work with some simple datasets to start learning about neural networks. The collection tf.keras.datasets contains only a few simple datasets and provides an elementary way of loading them. (Later, we will learn about TensorFlow datasets, which contains nearly 100 datasets and provides a high-performace input data pipelines to load the datasets.)

Keras basics

Let's briefly describe Keras concepts such as dense / convolutional / recurrent layers, sequential models, functional API, activation functions, loss functions, optimizers, and metrics.

Keras basics

Keras models for classification of MNIST digits and fashion items

Before formally defining sequential neural networks with dense layers, let's look at some simple Keras models showing how to use such networks for classification. We consider the problems of classifying images from the MNIST digits dataset and the fashion items dataset. These problems are so-called multi-class / single-label classifications problems.

Multi-class means that there are several classes. For instance, T-shirt, pullover or bag in the fashion items dataset.

Single-label means that classes are mutually exclusive. For instance, an image is either the digit 0, or the digit 1, etc. in the MNIST digits dataset.

The example neural networks in the notebooks below consist of three layers: input, hidden, and output layers. They use the softmax activation function in the last (output) layer and the categorical cross entropy loss function because the problems are multi-class, single-label classification problems. They also use the relu activation activation function for the hidden layer.

These notebooks also show how to split datasets into training datasets and test datasets and also discuss overfitting.

The notebook below uses pandas.DataFrame to display learning curves and to visually analyze predictions.

Notebook for classifying fashion items with dense layers and analyzing model performance

Generalization, overfitting, and splitting dataset in train set and test set

The goal of machine learning is to obtain models that perform well on new unseen data. It can happen that a model performs perfectly on the training data, but fails on new data. This is called overfitting. The following notes explain briefly how to deal with this important issue.

Generalization, overfitting, and train & test sets

Simple hold-out validation and K-fold validation

Simple hold-out validation and K-fold validation

Binary classification, logistic regression, sigmoid activation, binary cross entropy loss

Logistic regression is used for binary classification problems. Binary means that there are only two classes. For instance, a movie review has to be classified as either positive (class 1) or negative (class 0). There is only one output neuron whose activation indicates the probability of class 1. This output neuron uses the sigmoid activation function, which enforces that its activation inside the interval [0, 1], that is, is a valid probability.

The squared error loss could be used, but it is much better to use the binary cross entropy loss instead of the squared error loss because it speeds up training. The notes below derive the gradient for the two combinatations: sigmoid activation with squared error loss and sigmoid activation with binary cross entropy loss.

Logistic regression notes

The notebook below presents a simple elementary method for preprocessing text data so it can be input into a neural network. We will discuss more advanced methods for preprocessing text later.

This notebook also shows how we can use a validation set to monitor the performance of the model and subsequently choose a good number of epochs to prevent overfitting.

Notebook for classifying IMDB movie reviews with dense layers

Multi-class / single-label classification, softmax activation, categorical cross entropy loss

We already talked briefly about multi-class / single-label classification, softmax activation, and categorical cross entropy loss when presenting Keras examples for classifying MNIST digits and fashion items.

The notes below explain the mathematics behind softmax activation and categorical cross entropy loss and derive the gradient for this combination of activation and loss.

Multi-class / multi-label classification

Image that you receive an image of a face and that you have to decide (a) if the person is smiling or not and (b) if the person is wearing glasses. Similing and wearing glasses are independent of each other. This is an example of multi-class / multi-label classification.

Sigmoid activation functions are used in the output layer in multi-class / multi-label classification problems. The number of output neurons is equal to the number of classes, and each neuron uses the sigmoid activation function. The binary cross entropy loss is used for each output neuron.

We will look at some examples of multi-class / multi-label classification after introducting convolutional neural networks.

Underfitting, overfitting, and two simple methods for fighting overfitting: dropout and L1 / L2 regularization

Notebook for classifying movie reviews, demonstrating underfitting and overfitting
Notebook for classifying movie reviews, using regularization and dropout to combat overfitting (TO DO: add more details how regularization and dropout work)

Notes on backpropagation algorithm for computing gradients in sequential neural networks with dense layers

These notes explain how to compute the gradients for neural networks consisting of multiple dense layers. I will not go over the mathematical derivation of the backpropagation algorithm. Fortunately, the gradients are computed automatically in Keras.

My notes are mostly based on chapter 2 "How the backpropagation algorithm works" of the book "Neural Networks and Deep Learning".

Notes on forward propagation, backpropagation algorithm for computing partial derivatives wrt weights and biases

Numpy implementation of backpropagation algorithm

My code is based on the code described in chapter 5 "Getting started with neural networks" of the book "Deep Learning and the Game of Go".

Code for creating sequential neural networks with dense layers and training them with backprop and mini-batch SGD; currently, code is limited to (1) mean squared error loss and (2) sigmoid activations; the neural network learns rather slowly because the combination of sigmoid activation in the output layer and the mean squared error loss is suboptimial; ideally, we would use softmax activation together with categorical cross entropy loss; using CuPy on a GPU-instance instead of NumPy should speed-up training

You can also find implementations of neural networks from scratch in the book "Neural Networks and Deep Learning" and also in the book "Grokking Deep Learning".

Deep learning for computer vision (convolutional neural networks)

CNN slides

The notebooks below implement simple convolutional neural networks for classifying MNIST digits and fashion items

Transfer learning: classification of cats and dogs

TO DO: change to TensorFlow 2.x and tf.keras

based on Chapter 5 Deep learning for computer vision of the book Deep learning with Python by F. Chollet

!!! Remove the notebooks below? Redundant ? !!!

based on Google ML Practicum: Image Classification

TO DO: clean up everything below

Visualizing what convnets learn

based on chapter 5 Deep learning for computer vision of the book Deep learning with Python by F. Chollet

Visualizing intermediate activations
Visualizing convnet filters, the convnet filter visualizations at the bottom of the notebook look pretty cool!
Visualizing heatmaps of class activations
Visualizing heatmaps of class activations, modified version, changes softmax to linear activation in last layer
keras-vis This is a package for producing cool looking visualizations. I had problems using it on colab. !!! Fix it !!!

Some cool looking stuff

Based on Section 8.2 DeepDream and Section 8.3 Neural style transfer of the book Deep learning with Python by F. Chollet. I am not going to explain in detail how deep dream and neural style transfer work. I just wanted to include these notebooks to show you two cool examples of what can be done with deep neural networks.

Deep learning for computer vision (residual networks)

The goal is to introduce more advanced architectures and concepts. This is based onthe Keras documentation: CIFAR-10 ResNet.

The relevant research papers are:

Notebooks

Resnet for CIFAR10 - train/val/test

I have made several changes to the code from the Keras documentation. In the above notebook, I had to change the number of epochs and the learning rate schedule because the model is only trained on 40k and validated on 10k, whereas the model in the Keras documentation is trained on 50k and not validated at all. I wanted to have a situation that is similar to the situation in HW 2 so we can better compare the performance of the ResNet and the (normal) CNN.

Resnet for CIFAR10- train/test

TensorFlow datasets

TensorFlow datasets is a collection of nearly 100 ready-to-use datasets that can quickly help build high-performance input data pipelines for training TensorFlow models. Instead of downloading and manipulating datasets manually and then figuring out how to read their labels, TensorFlow datasets standardizes the data format so that it's easy to swap one dataset with another, often with just a single line of code change. As you will see later on, doing things like breaking the dataset down into training, validation, and testing is also a matter of a single line of code. The high-performance input data pipelines make it possible to work on the data in parallel. For instance, while the GPU is working with a batch of data, the CPU is prefeching the next batch.

one-shot learning
image similarity, face-recognition

Visualizing high-dimensional data using t-SNE

Text

Character-based

Word-based

Word embeddings
Using 1D convnets (TO D)
Word embeddings (TO DO: change notebook !!!)
Newsgroup classification with convolutional model using pretrained Glove embeddings (TO DO)
IMDB sentiment classification with LSTM model (TO DO)
...

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

schneider128k / machine_learning_course

Programming Languages

Labels

Projects that are alternatives of or similar to machine learning course