About the competition
Digit Recognizer is a competition that has been hosted on Kaggle for years( almost three and a half years so far?). The data for this competition were taken from the famous MNIST dataset, which has been extensively studied within the machine learning community( here is more information on that).
I tried the competition back a year ago with very naive method( basically a wrong way of performing PCA and Random Forest with hand picked hyperparameter values), and recently during the thanksgiving break, I got some time and revisited the competition to see if I can improve on my past results.
About the data set
The training and testing datasets are all .csv files. Not big, but using fread() function from data.table package can be a bit faster.

library(data.table) train < as.data.frame(fread("data/train.csv")) train$label < as.factor(as.character(train$label)) test < as.data.frame(fread("data/test.csv")) 
Look at the dimensions of the data set

dim(train) dim(test) table(train$label) 

## [1] 42000 785 ## [1] 28000 784 ## ## 0 1 2 3 4 5 6 7 8 9 ## 4132 4684 4177 4351 4072 3795 4137 4401 4063 4188 
The number of observations in the training set is n = 42000 and the dimension of the feature space is p = 784.
Looked at these numbers, I thought that the dimension of features is somewhat high and we might want to perform PCA on it.
Note, in the textbook, high dimensional problems refer to the situations when p is greater than n, so not the case here technically.
Each observation represents a 28 by 28 pixels grey scale image, and each cell value is an integer between 0 and 255. Our task is to build a model to predict what digit(0 to 9) each image is representing.
The labels are pretty balanced in the training set.
Visualise the images in the dataset
We need to reshape the observations(each is a single row with 784 values) back into a 28 by 28 matrix in order to display the images.
We can do so by writing a simple function to perform that transformation.

rotate < function(x) t(apply(x, 2, rev)) m = rotate(matrix(unlist(train[1,1]),nrow = 28,byrow = T)) image(m,col=grey.colors(255)) 
Here is another observation:

m = rotate(matrix(unlist(train[40462,1]),nrow = 28,byrow = T)) image(m,col=grey.colors(255)) 
Try a KNN model first
Before directly go and perform PCA, I tried a KNN model on the dataset first.
The KNN model is simply to look at the K points nearest to the one we want to make a prediction on by in some sense taking an average of the K nearest neighbors' value as the predicted value.
The model is usually used to demonstrate the problem called the curse of dimensionality, that is when the dimensionality of the feature space is high then the feature space is going to be very sparse and, as a result, the distances between any pair of data points are going to be very long. So for any given point in the feature space, the region(which should be a small local region ideally) that tightly bound the K nearest points to the given point is going to be very large and thus became a problem for KNN model.
Ok, I didn't think about all these complicated stuff last year when I did it. I built the KNN model simply because it is the easiest to code up one and I was lazy and always wanted to see how simple model works.

library(class) t0 < Sys.time() knn.model < knn(train[, 2:785], test, cl = train[,1], k = 50) print(Sys.time()  t0) submit < data.frame(ImageId = seq(1,28000), Label = knn.model) write.csv(submit, file = "simpleKnn.csv", row.names=F) 

## Time difference of 2.56604 hours 
So it took two and a half hour(the runtime is recalculated using my newer laptop this year, the same to all the runtime in this post) to build the KNN model on the full feature space with k = 50. And actually the result is not bad, this model will give a leader board score of 0.94743. But it is quite long to build such a single model. If we would use KFolds CrossValidation to choose the optimal k value, it will be days if not weeks for a handful pick of k values.
PCA the wrong way
Last year, when I performed PCA on the dataset, I actually merged the training features and testing features into a single dataset, calculated PCA and transformed the merged data set onto the first 100 PC directions.

data < rbind.data.frame(train[,2:ncol(train)],test) dim(data) data.pca < prcomp(data) pve = data.pca$sdev^2/sum(data.pca$sdev^2) cumsum(pve)[100] plot(pve, pch = 19, type = "l") 
The percentage of variance explained by the first 100 principal components is about 91.5%, and I transform the merged data set down to 100 dimensions.

smallTrain < as.data.frame(data.pca$x[1:nt, 1:100]) smallTest < as.data.frame(data.pca$x[(nt+1):nrow(data), 1:100]) smallTrain$label < Y 
I know now this is actually cheating because I was implicitly using the information of the testing set for the model building process. I should only use the training set to get the principal component directions and then transform the training and testing data using the directions learned this way.
But anyway the result showed that we can use PCA to greatly reduce the dimension of the features without loss of much information.
Oh, I did make a plot to visualise the whole dataset using the first two principal components.

ggplot(data = smallTrain, aes(x = PC1, y = PC2)) + geom_point(aes(color = label), size = 1, alpha = 0.5) + xlab("first principal component") + ylab("second principal component") 
A Random Forest fitted to the leaderboard

rf < randomForest(label~., data = smallTrain, mtry = 9, ntree = 500) submit < data.frame(ImageId = seq(1,nrow(smallTest)), Label = predict(rf, smallTest, type = "class")) write.csv(submit, file = "PCA_randomFroest.csv", row.names=F) 
When building the Random Forest model last year, I actually tried the trial and error to make the leaderboard happy... That is to manually change the values of mtry and ntree and see how the Kaggle leaderboard result would change.
The score is 0.9510 which is not bad, and an improve over the KNN model.
And this is how far I went last year( March 2014 to be specific).
The right way to perform PCA
The right way to perform PCA on this problem is to learn PCs using the training set and project both the training set and testing set onto the first some number of principal component directions.

train.pca < prcomp(train[,1]) smallTrain < as.data.frame(train.pca$x[1:nt, 1:100]) smallTest < as.data.frame(predict(train.pca, test)[,1:100]) smallTrain$label < Y 
And things will become complicated when we do KFolds CrossValidation, since that we should perform PCA only on the K1 folds to learn the directions and transform all the K folds on to that first some number of PC directions before we build model on the transformed K1 folds data and finally gain the estimated out of sample error rate for this run, So the run time for KFolds CrossValidation now is even worse since we need to do PCA in each iteration.
But it maybe not that bad if training a Random Forest is not that slow, but how can we know how long it will take? We can analysis the runtime.
Understand the runtime of Random Forest
This thanksgiving break, I wanted to know how long it will take to build a model on the full training set with a large number of trees in a Random Forest.
I did so doing some experiments on a grid of values of the number of training observations and the number of trees in the Random Forest model and tried to understand the time complexity for building a Random Forest model with the default mtry value.

library(randomForest) dt < ntrain < ntrees < vector() numTrain < c(1:10)*100 numTrees < c(1:10)*10 for (i in 1:length(numTrain)){ for (j in 1:length(numTrees)){ t0 < Sys.time() rf < randomForest(label~., data = train[1:numTrain[i],], ntree = numTrees[j]) dt < c(dt, Sys.time()  t0) ntrain < c(ntrain, numTrain[i]) ntrees < c(ntrees, numTrees[j]) } } df < cbind.data.frame(ntrain, ntrees, dt) 
I plotted two sets of plots

library(ggplot2) library(ggthemes) theme_set(theme_minimal(11)) ggplot(data = df, aes(x = ntrees, y = dt))+ geom_line()+ facet_wrap(~ntrain) 

ggplot(data = df, aes(x = ntrain, y = dt))+ geom_line()+ facet_wrap(~ntrees) 
So based on this two plots, if we fix the number of training observations(of course mtry and depth of trees are also fixed because they are left to be the default values), then the training time is linear in the number of trees in the forest. And it is the same linear relationship between training time and the number of observations used in training if the number of trees is fixed.
I guess the runtime for the Random Forest model is O(n*ntree) if mtry and depth of trees are fixed.
Actually, I later on found a post on Quora which states that the running time is O(n*ntree*d*mtry).
So I built a regression model using the interaction term ntree*ntrain as the independent variable to predict the training time by heavily extrapolating...(which is bad for regression analysis, in general, but might be OK for runtime analysis purpose)

runTimeModel < lm(dt~ntrain*ntrees, data = df) summary(runTimeModel) predict(runTimeModel, data.frame(ntrain = 42000, ntrees = 5000))/60/60 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Call: lm(formula = dt ~ ntrain * ntrees, data = df) Residuals: Min 1Q Median 3Q Max 0.57748 0.19535 0.02504 0.14021 1.47002 Coefficients: Estimate Std. Error t value Pr(>t) (Intercept) 1.777e+00 6.347e02 27.990 <2e16 *** ntrain 8.917e05 5.299e05 1.683 0.0932 . ntrees 6.885e03 5.299e04 12.995 <2e16 *** ntrain:ntrees 4.232e05 4.423e07 95.682 <2e16 ***  Signif. codes: 0 ¡®***¡¯ 0.001 ¡®**¡¯ 0.01 ¡®*¡¯ 0.05 ¡®.¡¯ 0.1 ¡® ¡¯ 1 Residual standard error: 0.2941 on 396 degrees of freedom Multiple Rsquared: 0.9937, Adjusted Rsquared: 0.9936 Fstatistic: 2.081e+04 on 3 and 396 DF, pvalue: < 2.2e16 1 2.46076 
So the model predicts that if I want to build a Random Forest with 5000 trees on full features, it will take roughly two and a half hours. Which is about as slow as building the KNN model, tuning the parameters using KFolds CrossValidation may be too slow.
Feature Engineering
This year I also tried to create some features to help the models, I made them based on my intuition about how a human  if he does not know how to distinguish them would find useful for this task if he is trained by other people showing them examples of digits.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

firstHorizontalPix < vector() lastHorizontalPix < vector() firstVerticalPix < vector() lastVerticalPix < vector() densityUpperLeft < vector() densityUpperRight < vector() densityBottomLeft < vector() densityBottomRight < vector() numberHorizontalLines < vector() numberVerticalLines < vector() symmetryHorizontal < vector() symmetryVertical < vector() t0 < Sys.time() data < rbind.data.frame(train[,1], test) for (i in 1:nrow(data)){ m = matrix(unlist(data[i,]),nrow = 28,byrow = T) csum < colSums(m) rsum < rowSums(m) firstHorizontalPix < c(firstHorizontalPix, min(which(rsum > 0))) lastHorizontalPix < c(lastHorizontalPix, max(which(rsum > 0))) firstVerticalPix < c(firstVerticalPix, min(which(csum > 0))) lastVerticalPix < c(lastVerticalPix, max(which(csum > 0))) densityUpperLeft < c(densityUpperLeft,sum(m[1:14,1:14])/sum(m)) densityUpperRight < c(densityUpperRight,sum(m[1:14,15:28])/sum(m)) densityBottomLeft < c(densityBottomLeft,sum(m[15:28,1:14])/sum(m)) densityBottomRight < c(densityBottomRight,sum(m[15:28,15:28])/sum(m)) numberHorizontalLines < c(numberHorizontalLines, sum(rsum >0)) numberVerticalLines < c(numberVerticalLines, sum(csum >0)) symmetryHorizontal < c(symmetryHorizontal, sum((m[,28:15] > 0) * (m[,1:14] > 0))) symmetryVertical < c(symmetryVertical, sum((m[28:15,] > 0) * (m[1:14,] > 0))) } engFeatures < cbind.data.frame(firstHorizontalPix, lastHorizontalPix, firstVerticalPix, lastVerticalPix, densityUpperLeft, densityUpperRight, densityBottomLeft, densityBottomRight, numberHorizontalLines, numberVerticalLines, symmetryHorizontal, symmetryVertical) engFeatures$densityUpper < engFeatures$densityUpperLeft + engFeatures$densityUpperRight engFeatures$densityBottom < engFeatures$densityBottomLeft + engFeatures$densityBottomRight engFeatures$densityLeft < engFeatures$densityUpperLeft + engFeatures$densityBottomLeft engFeatures$densityRight < engFeatures$densityUpperRight + engFeatures$densityBottomRight engFeatures$horizontalDiffs < engFeatures$lastHorizontalPixengFeatures$firstHorizontalPix engFeatures$verticalDiffs < engFeatures$lastVerticalPixengFeatures$firstVerticalPix engFeatures$firstHorizontalPix < as.factor(engFeatures$firstHorizontalPix) engFeatures$lastHorizontalPix < as.factor(engFeatures$lastHorizontalPix) engFeatures$firstVerticalPix < as.factor(engFeatures$firstVerticalPix) engFeatures$lastVerticalPix < as.factor(engFeatures$lastVerticalPix) 
firstHorizontalPix and lastHorizontalPix are to capture the row number where the first or last nonbackground color pixel appears in the image. Similarily, firstVerticalPix and lastVerticalPix are to capture the column number where the first or last nonbackground color pixel appears in the image. After randomly plotted many digits in the dataset I got the intuition that digit 0 tends to be shorter than digit 1 in this dataset, and, of course, digit 7 tends to be fatter than digit 1. So I also counted the number of nonempty rows with numberHorizontalLines and the number of nonempty columns with numberVerticalLines.
densityUpperLeft, densityUpperRight, densityBottomLeft and densityBottomRight are created to capture how dense the pixel values are in each part of the image, And densityUpper, densityBottom, densityLeft, densityRight are just some partial sum of them. My intuition is that for digits like 8,0,1,3, the pixel value density on the upper area is going to be close to the bottom part, but it will be different for digits like 7,6,2,5,4 and 9. And it is the same kind of logic behind these sets of features I created.
symmetryHorizontal and symmetryVertical are measuring the dregree of symmetry in the image.
horizontalDiffs and verticalDiffs describes how tall and how fat each digit is.
Feature Importance
I got the feature importance using Random Forest and here is a plot of that.

library(dplyr) featureImportance < cbind.data.frame(featureName = names(engTest), MeanDecreaseGini = rf$importance[,"MeanDecreaseGini"]) featureImportance < arrange(featureImportance, desc(MeanDecreaseGini)) featureImportance$featureName < factor(featureImportance$featureName, levels = as.character(featureImportance$featureName)) ggplot(data = featureImportance, aes(x = featureName, y = MeanDecreaseGini)) + geom_bar(stat = "identity", fill = "cyan3", width = 0.618) + coord_flip() + ggtitle("Feature Importance From Random Forest") ggsave("featureImportance.png") 
It seems that the features I created are somewhat important. To see it clearly, here is the list of the top 20 features.

head(featureImportance,20) 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

featureName MeanDecreaseGini 1 PC2 2195.8956 2 PC1 1853.3829 3 lastHorizontalPix 1818.2293 4 firstHorizontalPix 1786.4713 5 PC5 1452.5578 6 PC6 1173.8716 7 densityBottom 1136.1091 8 densityUpper 1114.4566 9 PC8 1016.7097 10 PC4 1013.9901 11 PC7 955.1968 12 densityUpperRight 842.0081 13 densityUpperLeft 795.8877 14 numberVerticalLines 777.9377 15 densityBottomLeft 714.5263 16 PC3 682.7365 17 verticalDiffs 666.8906 18 PC13 660.1781 19 firstVerticalPix 590.2418 20 densityBottomRight 581.4745 
CrossValidation with PCA
Even we are down to a little bit more than 100 features only, but tuning mtry and ntree or even depth with KFolds CrossValidation together with PCA in each iteration is time consuming. I coded it up, but never really run it.
Here is the code for choosing the mtry value for a fixed number of trees using such a method.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

library(caret) k = 10 folds < createFolds(Y, k = k) mtryGrid < c(1,3,5,7,9) accuracy < vector() testFolds < vector() mtry < vector() for (i in 1:k{ for (j in 1:length(mtryGrid)){ testingFold < folds[[i]] cvTrain < train[testingFold,2:785] cvTrainLabel < Y[testingFold] cvTestLabel < Y[testingFold] cvTest < test[testingFold,] pca < prcomp(cvTrain) cvPCTrain < cbind.data.frame(pca$x[,1:100], cvTrainLabel) cvPCTest < predict(pca, cvTest)[,1:100] rf < randomForest(cvTrainLabel~., data = cvPCTrain, ntree = 100, mtry = mtryGrid[j]) rf.pred < predict(rf, cvPCTest) t < table(cvTestLabel, rf.pred) acc < sum(diag(t))/sum(t) accuracy < c(accuracy, acc) testFolds < c(testFolds, i) mtry < c(mtry, j) } } 
Choosing mtry with OOB Error estimate
What I actually did is to look at the OOB Error estimates.
When each of the decision tree is built in the Random Forest, the training data for that decision is a bootstrap sample of the original dataset, on average it has 2/3 unique values from the original training set. So we can estimate the outofsample error rate with the 1/3 outofbag data. This is a valid way of measuring the outofsample error rate, and is much faster to do for Random Forest when compared with KFolds CrossValidation.
I searched a grid of mtry and ntree values using the OOB method, and generated a plot.

forests < list() ntreeTry < c(1:40)*100 for (i in 1:length(ntreeTry)){ print(paste("start to try ntreeTry ",ntreeTry[i])) set.seed(123) rf < tuneRF(x = engTrain[,1], y = engTrain$label, mtryStart = 1, ntreeTry = ntreeTry[i], plot = F) forests[[i]] < rf } df < cbind.data.frame(ntrees = sort(rep(1:40,4)*100), mtry = rep(c(1,2,4,8),10)) qplot(data = df, x = ntrees, y = OOBError,color = mtry, group = mtry, geom = "line") 
It seems that for higher mtry value the ntree value required for the OOB Error starts to converge is lower. It looks to me that picking mtry = 8, ntree = 1000 is a reasonable choice.
I am not sure if the onestandard error rule would apply here or not, but after the calculation, I found that for mtry = 8 the minimum number of trees needed so that the OOB Error is no higher than one standard error from the minimum value is 500.
So in a much longer searching time, I found the combination of mtry = 8 and ntree = 500, which is almost exactly the same as the ones I hand picked last year....
The hyper parameter values are almost identical, however I made some new features which I didn't have last year, would these looked promising features really help in terms of making the leaderboard happier?

set.seed(0306) t0 < Sys.time() rf < randomForest(label~., data = engTrain, ntree = 500, mtry = 8) print(Sys.time()  t0) submit < data.frame(ImageId = seq(1,nrow(engTest)), Label = predict(rf, engTest, type = "class")) write.csv(submit, file = "500Tree8MtryRFwith100PCAAndEngineeredFeatures.csv", row.names=F) 
Not really... the leaderboard score drops to 0.94329.
Closing
It may seem that I did a lot things in vain : < , however I think they are worth trying. No one can be lucky in every model he builds.