# Note on using Apex with multiple models and one optimizer.

This post is just a quick note on how to use Nvidia's Apex for pytorch with multiple models that somehow using a single optimizer.

I am currently working with a  classifier using a pre-trained backbone feature extractor(which needs to be finetuned as well). I could encapsulate them into one pytorch nn.module. But for some reason, I want to do ad-hoc modifications on the features. In my training loop it will looks like this:

So the problem is how to set up one optimizer for these two separated nn.modules, and how to initialize these two models one optimizer combo with apex.

The solution is quite simple:

# Wait... 24GB GPU memory is not enough? How to accumulate gradients in PyTorch

I was training the Nasnet-A-Large network on a 4 channel 512 by 512 images using PyTorch. Even with my beast GPU RTX Titan, I could only use a batch size of 8. The training is very volatile with that batch size, and I believe one way to combat that is to accumulate gradients for a few batches and then do a bigger update. Luckily, with PyTorch, it is very simple.

So, let's say below is your training loop:

We would only need a small modification to accumulate gradients:

The latter training code will accumulate gradients for 8 batches and do an update. Note that the backward pass is done on individual small batches still, this is crucial.

I originally implemented it as follows:

which is not only uglier but also not gonna work, because the backward step will cause GPU OOM for that the backward pass is on batch_size * num_batches images.

# Writing your own loss function/module for PyTorch

Yes, I am switching to PyTorch, and I am so far very happy with it.

Recently, I am working on a multilabel classification problem, where the evaluation metric is the macro f1 score. So, ideally, we would want the loss function to be aligned with our evaluation metric, instead of using standard BCE.

Initially, I was using the following function:

It is perfectly usable for the purpose of a loss function, like your typical training code:

Better, we can make it a PyTorch module, so that the usage is more like your typical PyTorch loss:

That is simply to put the original f1_loss function on to the forward pass of a simple module. As a result, I can explicitly put the module to GPU.

# Label Embedding in Multi-label classification

In the recent Kaggle competition, inclusive images challenge  I tried out label embedding technique for training multilabel classifiers, outlined in this paper by François Chollet.

The basic idea here is to decompose the pointwise mutual information(PMI) matrix from the training labels and use that to guide the training of the neural network model. The steps are as follow:

1. Encode training labels as you would with multilabel classification settings. Let $M$ (of size n by m, ie n training example with m labels) denote the matrix constructed by vertically stacking the label vectors.
2. The PMI (of size m by m) is a matrix with $PMI_{i,j}=log(\frac{P(i,j)}{P(i)*P(j)})$, it can be easily implemented via vectorized operations thus very efficient in computing, even on large datasets. See more explanation of the PMI here.
3. The embedding matrix $E$ is obtained by computing the singular value decomposition on PMI matrix and then take the dot product between $U$ and the first k columns of $\sqrt{\Sigma}$.
4. We then can use the embedding matrix to transform the original sparse encoded labels into dense vectors.
5. During the training of deep learning model, instead of using m sigmoid activations together with BCE loss in the end, now we can use k linear activation with cosine proximity loss.
6. During inference time, we take the model prediction and search in the rows from the embedding matrix $E$ and select the top similar vectors and find their corresponding labels.

Below is a toy example calculation of the label embedding procedure. The two pictures are the pairwise cosine similarity between item labels in the embedding space and a 2d display of items in the embedding space.

In my own experiments, I find the model trained on label embeddings are a bit more robust to label noises, it is faster in convergence and returns higher top k precision compared with models with logistic outputs.

I believe it is due to the high number of labels in the competition (m ~= 7000) problem contrasted with the small batches the model is trained on. As this label embedding is obtained from matrix factorization, it is similar to PCA that we keep crucial information and throw out some unnecessary detail/noise, except we are doing so on the labels instead of the inputs.

# Random states in multiprocessing, learnt a lesson after wasted a weeks GPU time.

I was recently training CNN on the openimages dataset using Keras. I am using a custom batch generator together with the .fit_generator() method in Keras, and observed super slow training progress.

My code looks something like this:

I wasted a lot of time debugging the model structure, loss, and optimizer, but the problem is much simpler. I eventually found it by printing out the indices been sampled.

The problem with the code is that when the generator gets duplicated on multiple workers, the random states also get copied, so the 8 workers have the same random state. As a result, during training, the model will see the exact same batch 8 times before seeing a new batch. The fix is easy, just insert a np.random.seed() before sampling the indices.