# Label Embedding in Multi-label classification

In the recent Kaggle competition, inclusive images challenge  I tried out label embedding technique for training multilabel classifiers, outlined in this paper by François Chollet.

The basic idea here is to decompose the pointwise mutual information(PMI) matrix from the training labels and use that to guide the training of the neural network model. The steps are as follow:

1. Encode training labels as you would with multilabel classification settings. Let $M$ (of size n by m, ie n training example with m labels) denote the matrix constructed by vertically stacking the label vectors.
2. The PMI (of size m by m) is a matrix with $PMI_{i,j}=log(\frac{P(i,j)}{P(i)*P(j)})$, it can be easily implemented via vectorized operations thus very efficient in computing, even on large datasets. See more explanation of the PMI here.
3. The embedding matrix $E$ is obtained by computing the singular value decomposition on PMI matrix and then take the dot product between $U$ and the first k columns of $\sqrt{\Sigma}$.
4. We then can use the embedding matrix to transform the original sparse encoded labels into dense vectors.
5. During the training of deep learning model, instead of using m sigmoid activations together with BCE loss in the end, now we can use k linear activation with cosine proximity loss.
6. During inference time, we take the model prediction and search in the rows from the embedding matrix $E$ and select the top similar vectors and find their corresponding labels.

Below is a toy example calculation of the label embedding procedure. The two pictures are the pairwise cosine similarity between item labels in the embedding space and a 2d display of items in the embedding space.

In my own experiments, I find the model trained on label embeddings are a bit more robust to label noises, it is faster in convergence and returns higher top k precision compared with models with logistic outputs.

I believe it is due to the high number of labels in the competition (m ~= 7000) problem contrasted with the small batches the model is trained on. As this label embedding is obtained from matrix factorization, it is similar to PCA that we keep crucial information and throw out some unnecessary detail/noise, except we are doing so on the labels instead of the inputs.

# Random states in multiprocessing, learnt a lesson after wasted a weeks GPU time.

I was recently training CNN on the openimages dataset using Keras. I am using a custom batch generator together with the .fit_generator() method in Keras, and observed super slow training progress.

My code looks something like this:

I wasted a lot of time debugging the model structure, loss, and optimizer, but the problem is much simpler. I eventually found it by printing out the indices been sampled.

The problem with the code is that when the generator gets duplicated on multiple workers, the random states also get copied, so the 8 workers have the same random state. As a result, during training, the model will see the exact same batch 8 times before seeing a new batch. The fix is easy, just insert a np.random.seed() before sampling the indices.

# Repost of my KDD Cup 2018 summary

I finished 30th place at this year's KDD CUP. I still remember back to 2015, when I was very rusty with coding and tried to attempt that years' KDD cup with my potato laptop Lenovo U310. I did not know what I was doing, all I did is trying to throw data into XGBoost and my performance then is a joke. I see myself became more and more capable of comming up with ideas and implement them out during these two years. And below is a repost of my summary to KDD 2018.

Hooray~! fellow KDD competitors. I entered this competition on day 1 and very quickly established a reasonable baseline. Due to some personal side of things, I practically stopped improving my solutions since the beginning of May. Even though my methods did not work really well compared to many top players in phase 2, but I think my solution may worth sharing due to it is relative simplicity. I did not touch the meo data at all, and one of my models is just calculating medians.

### Alternative data source

For new hourly air quality data, as shared in the forum, I am using this for London and this for Beijing instead of the API from the organizer.

### Handling missing data

I filled missing values in air quality data with 3 steps:

1. Fill missing values for a station-measure combo based on the values from other stations.
To be specific: I trained 131 lightgbm regressors for this. If PM2.5 reading on 2:00 May 20th is missing for Beijing aotizhongxin station, the regressor aotizhongxin_aq-PM2.5 will predict this value based on known PM2.5 readings on 2:00 May 20th from 34 other stations in Beijing.
I used thresholds to decide whether to do this imputation or not. If more than the threshold number of stations also don't have a reading, then skip this step.
2. Fill the remaining missing values by looking forward and backward to find known values.
3. Finally, replace all remaining missing values by overall mean value.

### Approaches

#### 1. median of medians

This is a simple model that worked reasonably well in this Kaggle competition.

To predict PM2.5 reading on 2:00 May 20th for aotizhongxin, look back for a window of days history, calculating the median 2:00 PM2.5 readings from aotizhongxin in that window. You do this median calculation exercise for a bunch of different window sizes to obtain a bunch medians. The median value of those medians is used as the prediction.

Intuitively this is just an aggregated yesterday once more. With more larger windows in the collection, the model memorizes the long-term trend better. The more you add in smaller windows, the quicker the model would respond to recent events.

This is practically even simpler than the median of medians. I treated the number of days history I throw at it and the model parameters changepoint_prior_scalen_changepoints as main hyperparameters and tweaked them. I did a bit work to parallelizing the fitting process for all the station-measure combos to speed up the fitting process, other than that, it is pretty much out of the box.

I tried to use holiday indicator or tweaking other parameters of the model and they all degrade the performance of the model.

#### 3. neural network

My neural network is a simple feed-forward network with a single shortcut, shamelessly copied the structure from a senior colleague's Kaggle solution with tweaked hidden layer sizes.
The model looks like this:

The input to the neural network are concatenated (1) raw history readings, (2) median summary values from different window_sizes, and (3) indicator variables for the city, type of measure.

The output layer in the network is a dense layer with 48 units, each corresponding to an hourly reading in the next 48 hours.

The model is trained directly using smape as loss function with Adam optimizer. I tried standardizing inputs into zero mean and unit variance, but it will cause a problem when used together with smape loss, thus I tried switching to a clipped version MAE loss, which produced similar results compared to raw input with smape loss.

The model can be trained on CPU only machine in very short time.

I tried out some CNN, RNN models but couldn't get them working better than this simple model, and had to abandon them.

### Training and validation setup

This is pretty tricky, and I am still not quite sure if I have done it correctly or not.

#### For approach 1 and 2

I tried to generate predictions for a few historical months, calculating daily smape scores locally. Then sample 25 days out to calculate a mean smape score. Do this sample-scoring a large number of times and take mean as local validation score. I used this score to select parameters.

#### For neural network

I split the history data into (X, y) pairs based on a splitting day, and then move the splitting day backward by 1 day to generate another (X, y) pair. Do this 60 times and vertically concatenate them to form my training data.

I used groupedCV split on the concatenated dataset to do cross-validation so that measures from one station don't end up in both training and validation set. During training, the batch size is specified so that data in the batch all based on the same splitting day. I did this trying to preventing information leaking.

I got average smape scores 0.40~44 for Beijing and 0.30-0.34 for London in my local validation setting. Which I think is pretty aligned with how it averages out through May.

### Closing

Without utilizing any other weather information or integrating any sort of forecasts, all my models failed miserably for events like the sudden peak on May 27th in Beijing.

# Build Tensorflow from source with GPU support after the meltdown kernel patch

So my google could compute instance with Nvidia-docker which I used to train deep learning models was suddenly not working a couple of days ago, and the reason seems to related to the recent Ubuntu kernel update that was intended to solve the meltdown issue. I found the solution is to install a different kernel and also build from source. As a reminder for myself, here are the steps:

2. Install dependencies for build tensorflow, also install cuda if not already. Remeber to reboot after install cuda.
3.  Might need to do the following
5. Install cuDNN, go to https://developer.nvidia.com/cudnn and download the files, scp them to the machine and install :
6. Get libcupti and Brazel
7. Get Tensorflow source and build with gpu support and finally install
8. You are good to go

# Running tensorflow with GPU on GCP VM through docker.

Recently, I am working on the speech command recognition competition on Kaggle (https://www.kaggle.com/c/tensorflow-speech-recognition-challenge/) by Google and got \$500 Google cloud platform credits. I am writing down rough instructions on how I set up my VM to do experiments with deep learning models on GCP.

1. First, request quota increase to use a GPU. I requested usage of a Nvidia Tesla K80 under Zone us-east-1.
2. Create VM instance in the requested zone, customize the VM to use GPU, also configure SSH access. Oh, I used an Ubuntu 16.04 os.
3. Login to the VM and install the Cuda driver
4. Install docker community edition.
5. Install Nvidia-docker
6. Fire up bash and pretty much good to go. Note notebooks is the default landing directory, and here you would want to specify a directory in your GCP VM that you want to share with the container so that it can access your training data and write results to your VM disk.

That's pretty much it. On a side note, I found it strange that my code actually ran slower on my GCP VM using docker than on my home PC with just 1070 card. I am suspecting that since my GCP VM's CPU is the old Haswell one (I tried to provisioning  a Skylake one, but the GCP portal keep telling me there is not enough resources to create one for me...), the training is slow due to my data augmentation process in my data generator, so the more powerful Tesla K80 is idle and waiting for batches to go in, and it is totally the fault of my crappy code...

It has been really long since I last posted anything here, but I am thinking about getting more back often.

Incase you are curious about the pricing, here is a screenshot of my current billing page. My VM instance is 8 core cpu, 30GB RAM, 128GB SSD disk and of course a Nvidia Tesla K80.