Contrastive Learning#
Unsupervised, Semi-Supervised, and Self-Supervised Learning#
As we discussed in the last section, unsupervised learning is a type of machine learning that involves training models on data without labels. This is in contrast to supervised learning, where models are trained on labeled data, and reinforcement learning, where models are trained through trial and error.
Unsupervised learning can be further divided into unsupervised, semi-supervised, and self-supervised learning:
In unsupervised learning, models are trained on data without labels.
In semi-supervised learning, models are trained on a combination of labeled and unlabeled data.
In self-supervised learning, models are trained on data that is automatically labeled by the model itself.
In this lecture, we will focus on self-supervised learning, and in particular, contrastive learning.
Pre-text Tasks#
In self-supervised learning, models are trained on pre-text tasks. These are tasks that are designed to provide supervision to the model without requiring human-labeled data. The model is trained to solve these pre-text tasks, and in the process, it learns useful representations of the data that can be transferred to downstream tasks.
Pre-text tasks can take many forms, such as predicting the rotation of an image, predicting the relative position of patches in an image, or predicting the color of a grayscale image. The key idea is that the pre-text task should be designed in such a way that the model learns useful representations of the data.
Contrastive Learning#
Contrastive learning is a technique for learning representations of data by contrasting similar and dissimilar pairs of data points. The idea is to learn a representation that brings similar data points closer together in the embedding space and pushes dissimilar data points further apart.
It is typically used in unsupervised learning and self-supervised learning, where the goal is to learn representations of data without the need for labeled data, and often as a pretraining (or pretext) step for supervised learning tasks.
The idea of pretraining is to learn a good representation of the data in an unsupervised manner, and then fine-tune this representation on a supervised task. This can help to improve the performance of the model on the supervised task, especially when labeled data is scarce.
A key element of contrastive learning is choosing the augmentation strategy. The augmentation strategy is used to create positive and negative pairs of data points for the contrastive loss. Positive pairs are pairs of data points that are similar, while negative pairs are pairs of data points that are dissimilar.
In choosing the augmentation strategy we are providing the model with a way to learn the invariances in the data. For example, in the case of images, we might use random crops, rotations, flips, and color distortions as augmentations. These augmentations help the model to learn to recognize objects regardless of their position, orientation, or color.
In this way we are implicitly teaching the model about the underlying structure of the data, without needing to provide explicit labels. This is the key idea behind contrastive learning. Let’s look at two different approaches for this.
SimCLR#
A Simple Framework for Contrastive Learning of Visual Representations (SimCLR) is a contrastive learning technique that learns visual representations of images by contrasting similar and dissimilar pairs of images. The technique is based on the idea of learning a representation of the image that captures the underlying structure of the image.
It was introduced by Chen et al. in 2020 and has nearly 17000 citations as of today: https://arxiv.org/abs/2002.05709
SimCLR works by training a neural network to predict the similarity of pairs of images. The network is trained using a contrastive loss function, which encourages the network to learn representations that are close together for similar images and far apart for dissimilar images.
The result is a set of visual representations of the images that capture the underlying structure of the images. These representations can be used for a variety of tasks, such as image retrieval, image classification, and image segmentation.

The key components of SimCLR are:
Data augmentation: SimCLR uses a variety of data augmentation techniques to generate pairs of similar and dissimilar images. This helps the network to learn representations that are invariant to small changes in the input data.
Contrastive loss function: SimCLR uses a contrastive loss function to train the network to learn representations that are close together for similar images and far apart for dissimilar images. The contrastive loss function encourages the network to learn a meaningful representation of the images that captures the underlying structure of the images.
Data Augmentation#
Choosing the right data augmentation techniques is crucial for the success of SimCLR. The choice of data augmentation techniques will depend on the characteristics of the data and the goals of the analysis.
Some common data augmentation techniques used in SimCLR include random cropping, random flipping, random color distortion, and random Gaussian blur.
As already discussed, these augmentations encode the invariances that we want the model to learn, such as invariance to translation, rotation, and color changes. These might not all apply to your data!
Contrastive Loss Function#
A variety of loss functions were compared in the original SimCLR paper, including the InfoNCE loss, the NT-Xent loss, and the N-Pair loss. The NT-Xent loss was found to perform the best, and is the loss function used in the SimCLR implementation:
$$ L_{i,j} = -\log\left(\frac{\exp(z_i \cdot z_j / \tau)}{\sum_{k=1}^{2N} \mathbb{1}_{[k \neq i]} \exp(z_i \cdot z_k / \tau)}\right) $$
where $z_i$ and $z_j$ are the representations of the positive pair of images, $\tau$ is a temperature parameter that controls the sharpness of the distribution, and $\mathbb{1}_{[k \neq i]}$ is an indicator function that is 1 if $k \neq i$ and 0 otherwise.
The model includes a projection head that maps the high-dimensional representations to a lower-dimensional space. This helps to improve the quality of the learned representations and makes the model more efficient. The projection head is typically a small neural network that consists of one or more fully connected layers with a non-linear activation function and is thrown away after training.
Tile2Vec#
Tile2Vec is another technique for learning visual representations of imagery using unsupervised learning. The technique is based on the idea of learning a representation of the spatial context of the image tiles, rather than the content of the tiles themselves.
You can read more about the algorithm in the original 2018 paper: https://arxiv.org/abs/1805.02855
Tile2Vec works by training a convolutional neural network to predict the spatial context of the image tiles. For each training image, a nearby and a distant image are chosen to act as similar and disimilar examples. A contrastive loss is then used to map this nearby and far away in the emdedding space.
The result is a set of visual representations of the image tiles that capture the spatial context of the tiles. These representations can be used for a variety of tasks, such as image retrieval, image classification, and image segmentation.
Tile2Vec is a powerful technique for learning visual representations of satellite imagery and has been shown to outperform other techniques for satellite image retrieval and classification.
The tile2vec algorithm uses a triplet loss, where the network is trained to minimize the distance between similar pairs of data points and maximize the distance between dissimilar pairs of data points:
$$ L = \sum_{i=1}^{N} \max(0, \alpha + d(f(x_i), f(x_i^+)) - d(f(x_i), f(x_i^-))) $$
where $ d $ is a distance function, $ f $ is a function that maps the data points to the embedding space, $ x_i $ is a data point, $ x_i^+ $ is a similar data point, $ x_i^- $ is a dissimilar data point, and $ \alpha $ is a margin that separates the similar and dissimilar pairs.
Applying Contrastive Learning to Remote Sensing#
Let’s take a look at an example of applying tile2vec to satellite imagery, and see how it can be used to learn visual representations of the image tiles.
One of the main focusses in my Lab is better understanding the role of clouds in the climate system. Clouds come in many shapes and sizes, and understanding their properties is crucial for improving climate models and predicting future climate change.
You might be familiar with different types of clouds, such as cumulus clouds, stratus clouds, and cirrus clouds. Each type of cloud has its own unique properties, such as shape, size, and altitude.

Only recently though we have started to explore the mesoscale morphology of clouds, which is the structure of clouds at scales of a few kilometers to a few hundred kilometers. This is important because the mesoscale morphology of clouds can have a significant impact on the Earth’s energy balance and climate - and we don’t really understand what controls it.
Bjorn Stevens’ lab recently published a paper classifying cloud types based on their mesoscale moprhology. He used a large dataset of satellite images of clouds and trained a convolutional neural network to classify the images into different cloud types (using a lot of labeled data).
The classes they used were Sugar, Flower, Fish, and Gravel. These are not the traditional cloud types you might be familiar with, but rather the mesoscale morphology of the clouds:

In our Lab we have been using tile2vec on satellite imagery to learn visual representations of the mesoscale morphology of clouds without labels. The idea is to learn a representation of the spatial context of the cloud tiles, rather than the content of the tiles themselves.
We want to capture the important morphological features of the clouds, across a broad range of conditions and scales, so we use a false color representation of the images to highlight the different temperature (and hence altitude) of the clouds.
Here is an animations of the kind of data we’re using:

We’re using a very simple ResNet architecture to learn the representations:

The key ingredient is how we choose the triplet - we pick a random tile from the training set, a nearby tile from the same image, and a dissimilar tile from a different image. This encodes the fact that we expect spatially nearby tiles to be similar, and distant tiles to different (in terms of the morphology of the clouds).
For example:

By training over 900,000 such triplets (and testing on a further 100,000), we can learn a representation of the mesoscale morphology of clouds that can be used for a variety of tasks.
The first thing we looked at was clustering the representations to see if we could identify different cloud types based on the morphology alone. We used k-means clustering to cluster the representations into distinct clusters, exploring the number of clusters from 9 to 30.
Already with N=9 we can see distinct clusters emerging:

We can plot a t-SNE visualization of the representations to see how well the clusters are separated in the embedding space:

We can also look at how the clusters are located geographically to see if they correspond to different regions of the world:

As you can see, these initial clusters are primarily based on the altitude of the cloud (based on it’s temperature) and the underlying surface - we need more clusters to capture the full range of cloud types.
Another cool thing we can do is to find the nearest neighbors of a given tile in the embedding. This is like a reverse image search, where we search for similar images based on their visual content:

And, even better - we can interpolate between any two tiles in the embedding space to see how the morphology of the clouds changes between them:

Finally, we used the learnt model weights as the backbone of a classifier to predict the cloud type based on the morphology of the clouds with a few hundred labeled examples.
Using our pre-trained model:
Frozen weights: Accuracy –> 0.4874
Unfrozen weights: Accuracy –> 0.7395
Using the off-the-shelf pre-trained ResNet model:
Frozen weights: Accuracy –> 0.5462
Unfrozen weights: Accuracy –> 0.7059