Kaggle Digit Recognizer Revisited (Using R)

About the competition

Digit Recognizer is a competition that has been hosted on Kaggle for years( almost three and a half years so far?). The data for this competition were taken from the famous MNIST dataset, which has been extensively studied within the machine learning community( here is more information on that).

I tried the competition back a year ago with very naive method( basically a wrong way of performing PCA and Random Forest with hand picked hyperparameter values), and recently during the thanksgiving break, I got some time and revisited the competition to see if I can improve on my past results.

About the data set

The training and testing datasets are all .csv files. Not big, but using fread() function from data.table package can be a bit faster.

Look at the dimensions of the data set

The number of observations in the training set is n = 42000 and the dimension of the feature space  is p = 784.

Looked at these numbers, I thought that the dimension of features is somewhat high and we might want to perform PCA on it.

Note, in the textbook, high dimensional problems refer to the situations when p is greater than n, so not the case here technically.

Each observation represents a 28 by 28 pixels grey scale image, and each cell value is an integer between 0 and 255. Our task is to build a model to predict what digit(0 to 9) each image is representing.

The labels are pretty balanced in the training set.

Visualise the images in the dataset

We need to reshape the observations(each is a single row with 784 values) back into a 28 by 28 matrix in order to display the images.

We can do so by writing a simple function to perform that transformation.


Here is another observation:


Try a K-NN model first

Before directly go and perform PCA, I tried a K-NN model on the dataset first.

The K-NN model is simply to look at the K points nearest to the one we want to make a prediction on by in some sense taking an average of the K nearest neighbors' value as the predicted value.

The model is usually used to demonstrate the problem called the curse of dimensionality, that is when the dimensionality of the feature space is high then the feature space is going to be very sparse and, as a result, the distances between any pair of data points are going to be very long. So for any given point in the feature space, the region(which should be a small local region ideally) that tightly bound the K nearest points to the given point is going to be very large and thus became a problem for K-NN model.

Ok, I didn't think about all these complicated stuff last year when I did it. I built the K-NN model simply because it is the easiest to code up one and I was lazy and always wanted to see how simple model works.

So it took two and a half hour(the runtime is recalculated using my newer laptop this year, the same to all the runtime in this post) to build the K-NN model on the full feature space with k = 50. And actually the result is not bad, this model will give a leader board score of 0.94743. But it is quite long to build such a single model. If we would use K-Folds Cross-Validation to choose the optimal k value, it will be days if not weeks for a handful pick of k values.

PCA the wrong way

Last year, when I performed PCA on the dataset, I actually merged the training features and testing features into a single dataset, calculated PCA and transformed the merged data set onto the first 100 PC directions.


The percentage of variance explained by the first 100 principal components is about 91.5%, and I transform the merged data set down to 100 dimensions.

I know now this is actually cheating because I was implicitly using the information of the testing set for the model building process. I should only use the training set to get the principal component directions and then transform the training and testing data using the directions learned this way.

But anyway the result showed that we can use PCA to greatly reduce the dimension of the features without loss of much information.

Oh, I did make a plot to visualise the whole dataset using the first two principal components.


A Random Forest fitted to the leaderboard

When building the Random Forest model last year, I actually tried the trial and error to make the leaderboard happy... That is to manually change the values of mtry and ntree and see how the Kaggle leaderboard result would change.

The score is 0.9510 which is not bad, and an improve over the K-NN model.

And this is how far I went last year( March 2014 to be specific).

The right way to perform PCA

The right way to perform PCA on this problem is to learn PCs using the training set and project both the training set and testing set onto the first some number of principal component directions.

And things will become complicated when we do K-Folds Cross-Validation, since that we should perform PCA only on the K-1 folds to learn the directions and transform all the K folds on to that first some number of PC directions before we build model on the transformed K-1 folds data and finally gain the estimated out of sample error rate for this run, So the run time for K-Folds Cross-Validation now is even worse since we need to do PCA in each iteration.

But it maybe not that bad if training a Random Forest is not that slow, but how can we know how long it will take? We can analysis the runtime.

Understand the runtime of Random Forest

This thanksgiving break, I wanted to know how long it will take to build a model on the full training set with a large number of trees in a Random Forest.

I did so doing some experiments on a grid of values of the number of training observations and the number of trees in the Random Forest model and tried to understand the time complexity for building a Random Forest model with the default mtry value.

I plotted two sets of plots



So based on this two plots, if we fix the number of training observations(of course mtry and depth of trees are also fixed because they are left to be the default values), then the training time is linear in the number of trees in the forest. And it is the same linear relationship between training time and the number of observations used in training if the number of trees is fixed.

I guess the runtime for the Random Forest model is O(n*ntree) if mtry and depth of trees are fixed.

Actually, I later on found a post on Quora which states that the running time is O(n*ntree*d*mtry).

So I built a regression model using the interaction term ntree*ntrain as the independent variable to predict the training time by heavily extrapolating...(which is bad for regression analysis, in general, but might be OK for runtime analysis purpose)

So the model predicts that if I want to build a Random Forest with 5000 trees on full features, it will take roughly two and a half hours. Which is about as slow as building the K-NN model, tuning the parameters using K-Folds Cross-Validation may be too slow.

Feature Engineering

This year I also tried to create some features to help the models, I made them based on my intuition about how a human - if he does not know how to distinguish them- would find useful for this task if he is trained by other people showing them examples of digits.

firstHorizontalPix and lastHorizontalPix are to capture the row number where the first or last non-background color pixel appears in the image. Similarily, firstVerticalPix and lastVerticalPix are to capture the column number where the first or last non-background color pixel appears in the image. After randomly plotted many digits in the dataset I got the intuition that digit tends to be shorter than digit 1 in this dataset, and, of course, digit 7 tends to be fatter than digit 1. So I also counted the number of non-empty rows with numberHorizontalLines and the number of non-empty columns with numberVerticalLines.

densityUpperLeftdensityUpperRightdensityBottomLeft and densityBottomRight are created to capture how dense the pixel values are in each part of the image, And densityUpper, densityBottom, densityLeft, densityRight are just some partial sum of them. My intuition is that for digits like 8,0,1,3, the pixel value density on the upper area is going to be close to the bottom part, but it will be different for digits like 7,6,2,5,4 and 9. And it is the same kind of logic behind these sets of features I created.

symmetryHorizontal and symmetryVertical are measuring the dregree of symmetry in the image.

horizontalDiffs and verticalDiffs describes how tall and how fat each digit is.

Feature Importance

I got the feature importance using Random Forest and here is a plot of that.


It seems that the features I created are somewhat important. To see it clearly, here is the list of the top 20 features.

Cross-Validation with PCA

Even we are down to a little bit more than 100 features only, but tuning mtry and ntree or even depth with K-Folds Cross-Validation together with PCA in each iteration is time consuming. I coded it up, but never really run it.

Here is the code for choosing the mtry value for a fixed number of trees using such a method.

Choosing mtry with OOB Error estimate

What I actually did is to look at the OOB Error estimates.

When each of the decision tree is built in the Random Forest, the training data for that decision is a bootstrap sample of the original dataset, on average it has 2/3 unique values from the original training set. So we can estimate the out-of-sample error rate with the 1/3 out-of-bag data. This is a valid way of measuring the out-of-sample error rate, and is much faster to do for Random Forest when compared with K-Folds Cross-Validation.

I searched a grid of mtry and ntree values using the OOB method, and generated a plot.


It seems that for higher mtry value the ntree  value required for the OOB Error starts to converge is lower. It looks to me that picking mtry = 8, ntree = 1000 is a reasonable choice.

I am not sure if the one-standard error rule would apply here or not, but after the calculation, I found that for mtry = 8 the minimum number of trees needed so that the OOB Error is no higher than one standard error from the minimum value is 500.

So in a much longer searching time, I found the combination of mtry = 8 and ntree = 500, which is almost exactly the same as the ones I hand picked last year....

The hyper parameter values are almost identical, however I made some new features which I didn't have last year, would these looked promising features really help in terms of making the leaderboard happier?

Not really... the leaderboard score drops to 0.94329.


It may seem that I did a lot things in vain : < , however I think they are worth trying. No one can be lucky in every model he builds.

Leave a Reply

Your email address will not be published. Required fields are marked *