# Label Embedding in Multi-label classification

In the recent Kaggle competition, inclusive images challenge  I tried out label embedding technique for training multilabel classifiers, outlined in this paper by François Chollet.

The basic idea here is to decompose the pointwise mutual information(PMI) matrix from the training labels and use that to guide the training of the neural network model. The steps are as follow:

1. Encode training labels as you would with multilabel classification settings. Let $M$ (of size n by m, ie n training example with m labels) denote the matrix constructed by vertically stacking the label vectors.
2. The PMI (of size m by m) is a matrix with $PMI_{i,j}=log(\frac{P(i,j)}{P(i)*P(j)})$, it can be easily implemented via vectorized operations thus very efficient in computing, even on large datasets. See more explanation of the PMI here.
3. The embedding matrix $E$ is obtained by computing the singular value decomposition on PMI matrix and then take the dot product between $U$ and the first k columns of $\sqrt{\Sigma}$.
4. We then can use the embedding matrix to transform the original sparse encoded labels into dense vectors.
5. During the training of deep learning model, instead of using m sigmoid activations together with BCE loss in the end, now we can use k linear activation with cosine proximity loss.
6. During inference time, we take the model prediction and search in the rows from the embedding matrix $E$ and select the top similar vectors and find their corresponding labels.

Below is a toy example calculation of the label embedding procedure. The two pictures are the pairwise cosine similarity between item labels in the embedding space and a 2d display of items in the embedding space.

In my own experiments, I find the model trained on label embeddings are a bit more robust to label noises, it is faster in convergence and returns higher top k precision compared with models with logistic outputs.

I believe it is due to the high number of labels in the competition (m ~= 7000) problem contrasted with the small batches the model is trained on. As this label embedding is obtained from matrix factorization, it is similar to PCA that we keep crucial information and throw out some unnecessary detail/noise, except we are doing so on the labels instead of the inputs.