One-shot learning with Siamese networks

Typical CNN classification methods involve a final fully-connected layer with neurons corresponding to the number of classes. This is suboptimal in situations where the number of classes is large, or changing.

In Siamese CNNs, we extract features from an image and convert it into an n-dimensional vector. We compare this n-dimensional vector with that of another image, and the model is trained such that images of the same class will produce similar vectors.

By comparing an unknown image against samples of labelled images, we are able to determine the labelled image which is most similar to the unknown image, and obtain a classification result. This provides Siamese networks with the ability to learn classification tasks with low training samples, as well as generalize to any number of classes.

Architecture

Much like a typical CNN, a Siamese CNN will have several convolutional layers, followed by fully-connected layers. The convolutional layers help to extract features from an image, before conversion into vectors for comparison.

Training

When training a Siamese CNN, we input two images, and a binary label indicating if the two images are of the same class. The last layer of the CNN is a fully-connected layer, which produces an n-dimensional vector. Subsequently, the output layer and the output vector will be used interchangably, and both refer to this layer. Depending on the label, the model will then try to minimize or maximize the distance between the vectors produced by the two images.

Note that the network that both images pass through are the same. This means that the weights and biases in the network for both images are identical throughout the training process.

Loss

In this project we experiment with two different kinds of loss functions. The loss is calculated based on the L1- or L2-distance between the outputs of the CNN (fully-connected layers) from the two images.

Loss with spring

In Dimensionality Reduction by Learning an Invariant Mapping the loss function as shown below is described. The following GitHub project is used as reference for the implementation of the loss function.

Sigmoid loss

Sigmoid loss for image recognition in Omniglot dataset is used in the paper Siamese Neural Networks for One-shot Image Recognition. The model architecture used in the paper is also the basis for the CNN for the Omniglot task.

MNIST

We start with MNIST to test our implementation. The model was trained with learning_rate=1e-4 over 20,000 iterations. The training results for several architectures are summarized below:

commit_hash	conv. kernel size	accuracy	description
`983a8a8`	`3x3`	0.9758	2 layer FC + 2-neuron out
`df5d2b9`	`5x5`	0.9844	2 layer conv + 2 layer FC + 2-neuron out
`df5d2b9`	`3x3`	0.9856	2 layer conv + 2 layer FC + 2-neuron out
`3757780`	`3x3`	0.9890	2 layer conv + 2 layer FC (out)

Transfer learning

We first train a CNN on an MNIST classification task, achieving 99.37% accuracy on the test set. We then transfer the weights from the convolutional layers to the Siamese CNN before training the Siamese model with learning_rate=1e-4 over 10,000 iterations. This achieved a test accuracy of 98.99%, higher than the current maximum attained without transfer learning.

Testing

For each of the ground truth images above, we obtain its output vector via the model. Then, for each image that we are evaluating, we obtain its output vector as well, then find the closest ground truth vector to it via L1- or L2-dist.

Omniglot

The Omniglot dataset is typically used for one-shot learning, as it contains a large number of classes, with few training samples per class.

While the training and testing classes were the same in MNIST, the Omniglot dataset allows us to test the model on completely different classes from the ones used in training.

A random seed of 0 was set for both the Python inbuilt random library, as well as Tensorflow.

Data

Training

Images in the images_background folder were used for training. For each class (e.g. Alphabet_of_the_Magi/character01), all possible combinations of pairs were appended to a list. For example, a class with 20 images yielded 20 choose 2 == 190 pairs.

n_samples number of pairs were then chosen at random from the possible pairs to form the training data for similar images. Subsequently, for each similar pair, we add a dissimilar pair by choosing two different classes at random, and choosing one image each from both classes. This ensures that the number of similar and dissimilar pairs are the same.

Testing

Images in the images_evaluation folder were used for testing. We use 20 classes (Angelic/character{01-20}) for testing, and determine accuracy by the number of correct predictions.

Results

Loss with spring

model_name	n_samples	n_iterations	learning_rate	dist	accuracy
`fc1`	20 000	50 000	1e-5	L1	0.4025
`fc1`	20 000	50 000	1e-5	L2	0.4150
`fc1`	40 000	50 000	1e-5	L1	0.4000
`fc1`	40 000	50 000	1e-5	L2	0.4000
`fc1_reg1`	20 000	50 000	1e-5	L1	0.2700
`fc1_reg1`	20 000	50 000	1e-5	L2	0.2725
`fc2`	20 000	50 000	1e-5	L1	0.2875
`fc2`	20 000	50 000	1e-5	L2	0.2800

`fc1`

Single fully-connected layer with 4096 neurons.

`fc1_reg1`

Regularization with 2e-4 for convolutional layers.

`fc2`

Two fully-connected layer with 2048 neurons each, dropout=0.5 between fc1 and fc2. Number of neurons was reduced due to OOM allocations.

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

gabrielwong159 / siamese

Programming Languages

Labels

Projects that are alternatives of or similar to siamese

One-shot learning with Siamese networks

Architecture

Training

Loss

Loss with spring

Sigmoid loss

MNIST

Transfer learning

Testing

Omniglot

Data

Training

Testing

Results

Loss with spring

`fc1`

`fc1_reg1`

`fc2`

References

Implementation

Reading

Dataset