# Label Embedding in Multi-label classification

In the recent Kaggle competition, inclusive images challenge  I tried out label embedding technique for training multilabel classifiers, outlined in this paper by François Chollet.

The basic idea here is to decompose the pointwise mutual information(PMI) matrix from the training labels and use that to guide the training of the neural network model. The steps are as follow:

1. Encode training labels as you would with multilabel classification settings. Let $M$ (of size n by m, ie n training example with m labels) denote the matrix constructed by vertically stacking the label vectors.
2. The PMI (of size m by m) is a matrix with $PMI_{i,j}=log(\frac{P(i,j)}{P(i)*P(j)})$, it can be easily implemented via vectorized operations thus very efficient in computing, even on large datasets. See more explanation of the PMI here.
3. The embedding matrix $E$ is obtained by computing the singular value decomposition on PMI matrix and then take the dot product between $U$ and the first k columns of $\sqrt{\Sigma}$.
4. We then can use the embedding matrix to transform the original sparse encoded labels into dense vectors.
5. During the training of deep learning model, instead of using m sigmoid activations together with BCE loss in the end, now we can use k linear activation with cosine proximity loss.
6. During inference time, we take the model prediction and search in the rows from the embedding matrix $E$ and select the top similar vectors and find their corresponding labels.

Below is a toy example calculation of the label embedding procedure. The two pictures are the pairwise cosine similarity between item labels in the embedding space and a 2d display of items in the embedding space.

In my own experiments, I find the model trained on label embeddings are a bit more robust to label noises, it is faster in convergence and returns higher top k precision compared with models with logistic outputs.

I believe it is due to the high number of labels in the competition (m ~= 7000) problem contrasted with the small batches the model is trained on. As this label embedding is obtained from matrix factorization, it is similar to PCA that we keep crucial information and throw out some unnecessary detail/noise, except we are doing so on the labels instead of the inputs.

# D3: Visualizing Titanic Survivors by Gender, Age and Class

Titanic: Machine Learning from Disaster is the 101 type of machine learning competition hosted on Kaggle since it started. The task is to predict who would survive the disaster given information on individual's age, gender, socio-economic status(class) and various other features.

Recently, during the winter break, I have started learning the JavaScript library D3.js. The above graph is a screenshot of my first visualization project I created with D3.js. The address to the live version of the project is here.

How to read the graph:

• Each rectangle in the graph represents a passenger on Titanic, color yellow means that the passenger survived the disaster and the color blue indicates that he does not.
• There can be multiple people with the same age, gender, and class values, so I set the opacity of these rectangles to be 20%. So the place on the graph where you can see solid yellow shows that those passengers have a higher chance of surviving, whereas solid blue indicates danger.

Based on this visualization we can see that:

1. females (young or old, except around age 25) and young males(under age 15) from middle and upper class tend to survive.
2. the overall survivor rate for female passengers is higher than male passengers.

So, without all the drama shown in the classic movie, this visualization basically predicts that Jack will most likely not able to make it, but Rose will survive...

Update: Jan 14

I made a couple of changes to the visualization during last few days. Now a newer version is available here.

# Make an Animation with R

In a recent project for my optimization and simulation class, I was tackling a portfolio optimization problem. I was asked to find a portfolio with the same return rate but lower portfolio variance. Using quadratic programming to obtain the minimal variance portfolio is not an issue for me.

The target portfolio has a expected return of 2.22% and the associated portfolio standard deviation is 3.31%. Using optimization, I found a portfolio maintaining the same return rate but a lower portfolio standard deviation of 2.85%.

Here is a plot comparing the simulated return distributions for these two portfolios.

After I got this plot, I was thinking about a way to show the differences in a dynamic way, better if I can make an animation for it and I wish to do so in R.  I found that the slope diagram might be a good chart type for this task and I found an R package called animation for creating animation within R.

There is no default geom in ggplot for creating the slope diagram, however, I found that one can create slope diagram using geom_segment().

Here is an example of slope diagram.  Basically, the vertical position determines the portfolio returns and for each iteration in the simulation we add a line segment connecting the two different portfolio return values from both sides and coloring the line segment.

To use the animation package, one need to write a loop or a function that can plot all the plots needed to generate the animation in time sequence order.

You can define your own function to make all the plots needed for your animation. And once you have done this, you can use the saveVideo() function from the animation package to generate a .mp4 animation.

To do so, you just need to specify your operation system and the path to ffmpeg executable in your system and then calling saveVideo() by passing the plot making function along with some parameters for setting up the video quality.

Here is the animation created using the above code.

The animation dynamically showing the performance difference between two portfolios, while the minimal variance portfolio has a lower chance of losing a huge amount of money, it also gives up some chance to earning large returns. This is depicted by the difference between density of points on the lower and upper parts in two sides. Also, the amount of red color and green color are approximately equal, suggesting that half of the time the minimal variance will yield a lower return than the target portfolio.

There are many other options in the animation package. You can use it to generate GIF, HTML or SWF animations with similar method.

# Kaggle Digit Recognizer Revisited (Using R)

## About the competition

Digit Recognizer is a competition that has been hosted on Kaggle for years( almost three and a half years so far?). The data for this competition were taken from the famous MNIST dataset, which has been extensively studied within the machine learning community( here is more information on that).

I tried the competition back a year ago with very naive method( basically a wrong way of performing PCA and Random Forest with hand picked hyperparameter values), and recently during the thanksgiving break, I got some time and revisited the competition to see if I can improve on my past results.

## About the data set

The training and testing datasets are all .csv files. Not big, but using fread() function from data.table package can be a bit faster.

## Look at the dimensions of the data set

The number of observations in the training set is n = 42000 and the dimension of the feature space  is p = 784.

Looked at these numbers, I thought that the dimension of features is somewhat high and we might want to perform PCA on it.

Note, in the textbook, high dimensional problems refer to the situations when p is greater than n, so not the case here technically.

Each observation represents a 28 by 28 pixels grey scale image, and each cell value is an integer between 0 and 255. Our task is to build a model to predict what digit(0 to 9) each image is representing.

The labels are pretty balanced in the training set.

## Visualise the images in the dataset

We need to reshape the observations(each is a single row with 784 values) back into a 28 by 28 matrix in order to display the images.

We can do so by writing a simple function to perform that transformation.

Here is another observation:

## Try a K-NN model first

Before directly go and perform PCA, I tried a K-NN model on the dataset first.

The K-NN model is simply to look at the K points nearest to the one we want to make a prediction on by in some sense taking an average of the K nearest neighbors' value as the predicted value.

The model is usually used to demonstrate the problem called the curse of dimensionality, that is when the dimensionality of the feature space is high then the feature space is going to be very sparse and, as a result, the distances between any pair of data points are going to be very long. So for any given point in the feature space, the region(which should be a small local region ideally) that tightly bound the K nearest points to the given point is going to be very large and thus became a problem for K-NN model.

Ok, I didn't think about all these complicated stuff last year when I did it. I built the K-NN model simply because it is the easiest to code up one and I was lazy and always wanted to see how simple model works.

So it took two and a half hour(the runtime is recalculated using my newer laptop this year, the same to all the runtime in this post) to build the K-NN model on the full feature space with k = 50. And actually the result is not bad, this model will give a leader board score of 0.94743. But it is quite long to build such a single model. If we would use K-Folds Cross-Validation to choose the optimal k value, it will be days if not weeks for a handful pick of k values.

## PCA the wrong way

Last year, when I performed PCA on the dataset, I actually merged the training features and testing features into a single dataset, calculated PCA and transformed the merged data set onto the first 100 PC directions.

The percentage of variance explained by the first 100 principal components is about 91.5%, and I transform the merged data set down to 100 dimensions.

I know now this is actually cheating because I was implicitly using the information of the testing set for the model building process. I should only use the training set to get the principal component directions and then transform the training and testing data using the directions learned this way.

But anyway the result showed that we can use PCA to greatly reduce the dimension of the features without loss of much information.

Oh, I did make a plot to visualise the whole dataset using the first two principal components.

## A Random Forest fitted to the leaderboard

When building the Random Forest model last year, I actually tried the trial and error to make the leaderboard happy... That is to manually change the values of mtry and ntree and see how the Kaggle leaderboard result would change.

The score is 0.9510 which is not bad, and an improve over the K-NN model.

And this is how far I went last year( March 2014 to be specific).

## The right way to perform PCA

The right way to perform PCA on this problem is to learn PCs using the training set and project both the training set and testing set onto the first some number of principal component directions.

And things will become complicated when we do K-Folds Cross-Validation, since that we should perform PCA only on the K-1 folds to learn the directions and transform all the K folds on to that first some number of PC directions before we build model on the transformed K-1 folds data and finally gain the estimated out of sample error rate for this run, So the run time for K-Folds Cross-Validation now is even worse since we need to do PCA in each iteration.

But it maybe not that bad if training a Random Forest is not that slow, but how can we know how long it will take? We can analysis the runtime.

## Understand the runtime of Random Forest

This thanksgiving break, I wanted to know how long it will take to build a model on the full training set with a large number of trees in a Random Forest.

I did so doing some experiments on a grid of values of the number of training observations and the number of trees in the Random Forest model and tried to understand the time complexity for building a Random Forest model with the default mtry value.

I plotted two sets of plots

So based on this two plots, if we fix the number of training observations(of course mtry and depth of trees are also fixed because they are left to be the default values), then the training time is linear in the number of trees in the forest. And it is the same linear relationship between training time and the number of observations used in training if the number of trees is fixed.

I guess the runtime for the Random Forest model is O(n*ntree) if mtry and depth of trees are fixed.

Actually, I later on found a post on Quora which states that the running time is O(n*ntree*d*mtry).

So I built a regression model using the interaction term ntree*ntrain as the independent variable to predict the training time by heavily extrapolating...(which is bad for regression analysis, in general, but might be OK for runtime analysis purpose)

So the model predicts that if I want to build a Random Forest with 5000 trees on full features, it will take roughly two and a half hours. Which is about as slow as building the K-NN model, tuning the parameters using K-Folds Cross-Validation may be too slow.

## Feature Engineering

This year I also tried to create some features to help the models, I made them based on my intuition about how a human - if he does not know how to distinguish them- would find useful for this task if he is trained by other people showing them examples of digits.

firstHorizontalPix and lastHorizontalPix are to capture the row number where the first or last non-background color pixel appears in the image. Similarily, firstVerticalPix and lastVerticalPix are to capture the column number where the first or last non-background color pixel appears in the image. After randomly plotted many digits in the dataset I got the intuition that digit tends to be shorter than digit 1 in this dataset, and, of course, digit 7 tends to be fatter than digit 1. So I also counted the number of non-empty rows with numberHorizontalLines and the number of non-empty columns with numberVerticalLines.

densityUpperLeftdensityUpperRightdensityBottomLeft and densityBottomRight are created to capture how dense the pixel values are in each part of the image, And densityUpper, densityBottom, densityLeft, densityRight are just some partial sum of them. My intuition is that for digits like 8,0,1,3, the pixel value density on the upper area is going to be close to the bottom part, but it will be different for digits like 7,6,2,5,4 and 9. And it is the same kind of logic behind these sets of features I created.

symmetryHorizontal and symmetryVertical are measuring the dregree of symmetry in the image.

horizontalDiffs and verticalDiffs describes how tall and how fat each digit is.

## Feature Importance

I got the feature importance using Random Forest and here is a plot of that.

It seems that the features I created are somewhat important. To see it clearly, here is the list of the top 20 features.

## Cross-Validation with PCA

Even we are down to a little bit more than 100 features only, but tuning mtry and ntree or even depth with K-Folds Cross-Validation together with PCA in each iteration is time consuming. I coded it up, but never really run it.

Here is the code for choosing the mtry value for a fixed number of trees using such a method.

## Choosing mtry with OOB Error estimate

What I actually did is to look at the OOB Error estimates.

When each of the decision tree is built in the Random Forest, the training data for that decision is a bootstrap sample of the original dataset, on average it has 2/3 unique values from the original training set. So we can estimate the out-of-sample error rate with the 1/3 out-of-bag data. This is a valid way of measuring the out-of-sample error rate, and is much faster to do for Random Forest when compared with K-Folds Cross-Validation.

I searched a grid of mtry and ntree values using the OOB method, and generated a plot.

It seems that for higher mtry value the ntree  value required for the OOB Error starts to converge is lower. It looks to me that picking mtry = 8, ntree = 1000 is a reasonable choice.

I am not sure if the one-standard error rule would apply here or not, but after the calculation, I found that for mtry = 8 the minimum number of trees needed so that the OOB Error is no higher than one standard error from the minimum value is 500.

So in a much longer searching time, I found the combination of mtry = 8 and ntree = 500, which is almost exactly the same as the ones I hand picked last year....

The hyper parameter values are almost identical, however I made some new features which I didn't have last year, would these looked promising features really help in terms of making the leaderboard happier?

Not really... the leaderboard score drops to 0.94329.

## Closing

It may seem that I did a lot things in vain : < , however I think they are worth trying. No one can be lucky in every model he builds.

# VIDEO 3 - A Basic Line Plot

## make the x values as ordered categorical data type won't help, it is

probably due to the difference in implementation ggplot in Python.

# VIDEO 4 - Adding the Hour of the Day

## Create out plot/Change the colors

ggplot in python does not support aes(group=Var) yet

Also the ggplot version I got using pip does not seem to plot legend correctly.

## Make a heatmap:

Struggled to plot heatmap using ggplot in Python without success. Turning to the old friend matplotlib

## VIDEO 5 - Maps

Given up on it...