## About the competition

Digit Recognizer is a competition that has been hosted on *Kaggle* for years( almost three and a half years so far?). The data for this competition were taken from the famous *MNIST* dataset, which has been extensively studied within the machine learning community( here is more information on that).

I tried the competition back a year ago with very naive method( basically a wrong way of performing *PCA* and *Random Forest* with hand picked hyperparameter values), and recently during the thanksgiving break, I got some time and revisited the competition to see if I can improve on my past results.

## About the data set

The training and testing datasets are all *.csv* files. Not big, but using fread() function from data.table package can be a bit faster.

1 2 3 4 |
library(data.table) train <- as.data.frame(fread("data/train.csv")) train$label <- as.factor(as.character(train$label)) test <- as.data.frame(fread("data/test.csv")) |

## Look at the dimensions of the data set

1 2 3 |
dim(train) dim(test) table(train$label) |

1 2 3 4 5 |
## [1] 42000 785 ## [1] 28000 784 ## ## 0 1 2 3 4 5 6 7 8 9 ## 4132 4684 4177 4351 4072 3795 4137 4401 4063 4188 |

The number of observations in the training set is *n = 42000* and the dimension of the feature space is *p = 784*.

Looked at these numbers, I thought that the dimension of features is somewhat high and we might want to perform *PCA* on it.

Note, in the textbook, *high dimensional* problems refer to the situations when* p is greater than n*, so not the case here technically.

Each observation represents a 28 by 28 pixels grey scale image, and each cell value is an integer between 0 and 255. Our task is to build a model to predict what digit(0 to 9) each image is representing.

The labels are pretty balanced in the training set.

## Visualise the images in the dataset

We need to reshape the observations(each is a single row with 784 values) back into a 28 by 28 matrix in order to display the images.

We can do so by writing a simple function to perform that transformation.

1 2 3 |
rotate <- function(x) t(apply(x, 2, rev)) m = rotate(matrix(unlist(train[1,-1]),nrow = 28,byrow = T)) image(m,col=grey.colors(255)) |

Here is another observation:

1 2 |
m = rotate(matrix(unlist(train[40462,-1]),nrow = 28,byrow = T)) image(m,col=grey.colors(255)) |

## Try a K-NN model first

Before directly go and perform PCA, I tried a K-NN model on the dataset first.

The* K-NN* model is simply to look at the K points nearest to the one we want to make a prediction on by in some sense taking an average of the K nearest neighbors' value as the predicted value.

The model is usually used to demonstrate the problem called *the curse of dimensionality*, that is when the dimensionality of the feature space is high then the feature space is going to be very sparse and, as a result, the distances between any pair of data points are going to be very long. So for any given point in the feature space, the region(which should be a small local region ideally) that tightly bound the K nearest points to the given point is going to be very large and thus became a problem for *K-NN* model.

Ok, I didn't think about all these complicated stuff last year when I did it. I built the *K-NN* model simply because it is the easiest to code up one and I was lazy and always wanted to see how simple model works.

1 2 3 4 5 6 |
library(class) t0 <- Sys.time() knn.model <- knn(train[, 2:785], test, cl = train[,1], k = 50) print(Sys.time() - t0) submit <- data.frame(ImageId = seq(1,28000), Label = knn.model) write.csv(submit, file = "simpleKnn.csv", row.names=F) |

1 |
## Time difference of 2.56604 hours |

So it took two and a half hour(the runtime is recalculated using my newer laptop this year, the same to all the runtime in this post) to build the *K-NN* model on the full feature space with *k = 50*. And actually the result is not bad, this model will give a leader board score of *0.94743*. But it is quite long to build such a single model. If we would use *K-Folds Cross-Validation* to choose the optimal k value, it will be days if not weeks for a handful pick of k values.

## PCA the wrong way

Last year, when I performed *PCA* on the dataset, I actually merged the training features and testing features into a single dataset, calculated *PCA* and transformed the merged data set onto the first *100* PC directions.

1 2 3 4 5 6 |
data <- rbind.data.frame(train[,2:ncol(train)],test) dim(data) data.pca <- prcomp(data) pve = data.pca$sdev^2/sum(data.pca$sdev^2) cumsum(pve)[100] plot(pve, pch = 19, type = "l") |

1 |
## [1] 0.9149666 |

The percentage of variance explained by the first *100* principal components is about *91.5%*, and I transform the merged data set down to *100* dimensions.

1 2 3 |
smallTrain <- as.data.frame(data.pca$x[1:nt, 1:100]) smallTest <- as.data.frame(data.pca$x[(nt+1):nrow(data), 1:100]) smallTrain$label <- Y |

I know now this is actually *cheating* because I was implicitly using the information of the testing set for the model building process. I should only use the training set to get the principal component directions and then transform the training and testing data using the directions learned this way.

But anyway the result showed that we can use *PCA* to greatly reduce the dimension of the features without loss of much information.

Oh, I did make a plot to visualise the whole dataset using the first two principal components.

1 2 3 4 |
ggplot(data = smallTrain, aes(x = PC1, y = PC2)) + geom_point(aes(color = label), size = 1, alpha = 0.5) + xlab("first principal component") + ylab("second principal component") |

## A Random Forest fitted to the leaderboard

1 2 3 4 |
rf <- randomForest(label~., data = smallTrain, mtry = 9, ntree = 500) submit <- data.frame(ImageId = seq(1,nrow(smallTest)), Label = predict(rf, smallTest, type = "class")) write.csv(submit, file = "PCA_randomFroest.csv", row.names=F) |

When building the *Random Forest* model last year, I actually tried the trial and error to make the leaderboard happy... That is to manually change the values of *mtry* and *ntree* and see how the Kaggle leaderboard result would change.

The score is *0.9510* which is not bad, and an improve over the *K-NN* model.

And this is how far I went last year( March 2014 to be specific).

## The right way to perform PCA

The right way to perform PCA on this problem is to learn PCs using the training set and project both the training set and testing set onto the first some number of principal component directions.

1 2 3 4 |
train.pca <- prcomp(train[,-1]) smallTrain <- as.data.frame(train.pca$x[1:nt, 1:100]) smallTest <- as.data.frame(predict(train.pca, test)[,1:100]) smallTrain$label <- Y |

And things will become complicated when we do *K-Folds Cross-Validation*, since that we should perform *PCA* only on the *K-1 folds* to learn the directions and transform *all the K folds* on to that first some number of PC directions before we build model on the transformed K-1 folds data and finally gain the estimated out of sample error rate for this run, So the run time for *K-Folds Cross-Validation* now is even worse since we need to do *PCA* in each iteration.

But it maybe not that bad if training a *Random Forest* is not that slow, but how can we know how long it will take? We can analysis the runtime.

## Understand the runtime of Random Forest

This thanksgiving break, I wanted to know how long it will take to build a model on the full training set with a large number of trees in a *Random Forest*.

I did so doing some experiments on a grid of values of the number of training observations and the number of trees in the *Random Forest* model and tried to understand the time complexity for building a *Random Forest* model with the default *mtry* value.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
library(randomForest) dt <- ntrain <- ntrees <- vector() numTrain <- c(1:10)*100 numTrees <- c(1:10)*10 for (i in 1:length(numTrain)){ for (j in 1:length(numTrees)){ t0 <- Sys.time() rf <- randomForest(label~., data = train[1:numTrain[i],], ntree = numTrees[j]) dt <- c(dt, Sys.time() - t0) ntrain <- c(ntrain, numTrain[i]) ntrees <- c(ntrees, numTrees[j]) } } df <- cbind.data.frame(ntrain, ntrees, dt) |

I plotted two sets of plots

1 2 3 4 5 6 7 |
library(ggplot2) library(ggthemes) theme_set(theme_minimal(11)) ggplot(data = df, aes(x = ntrees, y = dt))+ geom_line()+ facet_wrap(~ntrain) |

1 2 3 |
ggplot(data = df, aes(x = ntrain, y = dt))+ geom_line()+ facet_wrap(~ntrees) |

So based on this two plots, if we fix the number of training observations(of course *mtry* and depth of trees are also fixed because they are left to be the default values), then the training time is linear in the number of trees in the forest. And it is the same linear relationship between training time and the number of observations used in training if the number of trees is fixed.

I guess the runtime for the* Random Forest* model is *O(n*ntree)* if mtry and depth of trees are fixed.

Actually, I later on found a post on Quora which states that the running time is* O(n*ntree*d*mtry)*.

So I built a regression model using the interaction term ntree*ntrain as the independent variable to predict the training time by heavily extrapolating...(which is bad for regression analysis, in general, but might be OK for runtime analysis purpose)

1 2 3 |
runTimeModel <- lm(dt~ntrain*ntrees, data = df) summary(runTimeModel) predict(runTimeModel, data.frame(ntrain = 42000, ntrees = 5000))/60/60 |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
Call: lm(formula = dt ~ ntrain * ntrees, data = df) Residuals: Min 1Q Median 3Q Max -0.57748 -0.19535 -0.02504 0.14021 1.47002 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.777e+00 6.347e-02 27.990 <2e-16 *** ntrain 8.917e-05 5.299e-05 1.683 0.0932 . ntrees -6.885e-03 5.299e-04 -12.995 <2e-16 *** ntrain:ntrees 4.232e-05 4.423e-07 95.682 <2e-16 *** --- Signif. codes: 0 ¡®***¡¯ 0.001 ¡®**¡¯ 0.01 ¡®*¡¯ 0.05 ¡®.¡¯ 0.1 ¡® ¡¯ 1 Residual standard error: 0.2941 on 396 degrees of freedom Multiple R-squared: 0.9937, Adjusted R-squared: 0.9936 F-statistic: 2.081e+04 on 3 and 396 DF, p-value: < 2.2e-16 1 2.46076 |

So the model predicts that if I want to build a *Random Forest* with 5000 trees on full features, it will take roughly two and a half hours. Which is about as slow as building the *K-NN* model, tuning the parameters using *K-Folds Cross-Validation* may be too slow.

## Feature Engineering

This year I also tried to create some features to help the models, I made them based on my intuition about how a human - if he does not know how to distinguish them- would find useful for this task if he is trained by other people showing them examples of digits.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
firstHorizontalPix <- vector() lastHorizontalPix <- vector() firstVerticalPix <- vector() lastVerticalPix <- vector() densityUpperLeft <- vector() densityUpperRight <- vector() densityBottomLeft <- vector() densityBottomRight <- vector() numberHorizontalLines <- vector() numberVerticalLines <- vector() symmetryHorizontal <- vector() symmetryVertical <- vector() t0 <- Sys.time() data <- rbind.data.frame(train[,-1], test) for (i in 1:nrow(data)){ m = matrix(unlist(data[i,]),nrow = 28,byrow = T) csum <- colSums(m) rsum <- rowSums(m) firstHorizontalPix <- c(firstHorizontalPix, min(which(rsum > 0))) lastHorizontalPix <- c(lastHorizontalPix, max(which(rsum > 0))) firstVerticalPix <- c(firstVerticalPix, min(which(csum > 0))) lastVerticalPix <- c(lastVerticalPix, max(which(csum > 0))) densityUpperLeft <- c(densityUpperLeft,sum(m[1:14,1:14])/sum(m)) densityUpperRight <- c(densityUpperRight,sum(m[1:14,15:28])/sum(m)) densityBottomLeft <- c(densityBottomLeft,sum(m[15:28,1:14])/sum(m)) densityBottomRight <- c(densityBottomRight,sum(m[15:28,15:28])/sum(m)) numberHorizontalLines <- c(numberHorizontalLines, sum(rsum >0)) numberVerticalLines <- c(numberVerticalLines, sum(csum >0)) symmetryHorizontal <- c(symmetryHorizontal, sum((m[,28:15] > 0) * (m[,1:14] > 0))) symmetryVertical <- c(symmetryVertical, sum((m[28:15,] > 0) * (m[1:14,] > 0))) } engFeatures <- cbind.data.frame(firstHorizontalPix, lastHorizontalPix, firstVerticalPix, lastVerticalPix, densityUpperLeft, densityUpperRight, densityBottomLeft, densityBottomRight, numberHorizontalLines, numberVerticalLines, symmetryHorizontal, symmetryVertical) engFeatures$densityUpper <- engFeatures$densityUpperLeft + engFeatures$densityUpperRight engFeatures$densityBottom <- engFeatures$densityBottomLeft + engFeatures$densityBottomRight engFeatures$densityLeft <- engFeatures$densityUpperLeft + engFeatures$densityBottomLeft engFeatures$densityRight <- engFeatures$densityUpperRight + engFeatures$densityBottomRight engFeatures$horizontalDiffs <- engFeatures$lastHorizontalPix-engFeatures$firstHorizontalPix engFeatures$verticalDiffs <- engFeatures$lastVerticalPix-engFeatures$firstVerticalPix engFeatures$firstHorizontalPix <- as.factor(engFeatures$firstHorizontalPix) engFeatures$lastHorizontalPix <- as.factor(engFeatures$lastHorizontalPix) engFeatures$firstVerticalPix <- as.factor(engFeatures$firstVerticalPix) engFeatures$lastVerticalPix <- as.factor(engFeatures$lastVerticalPix) |

*firstHorizontalPix* and *lastHorizontalPix* are to capture the row number where the first or last non-background color pixel appears in the image. Similarily, *firstVerticalPix* and *lastVerticalPix* are to capture the column number where the first or last non-background color pixel appears in the image. After randomly plotted many digits in the dataset I got the intuition that digit *0 *tends to be shorter than digit *1* in this dataset, and, of course, digit *7* tends to be fatter than digit *1*. So I also counted the number of non-empty rows with *numberHorizontalLines* and the number of non-empty columns with *numberVerticalLines*.

*densityUpperLeft*, *densityUpperRight*, *densityBottomLeft* and *densityBottomRight* are created to capture how dense the pixel values are in each part of the image, And *densityUpper*, *densityBottom*, *densityLeft*, *densityRight *are just some partial sum of them. My intuition is that for digits like *8,0,1,3, *the pixel value density on the upper area is going to be close to the bottom part, but it will be different for digits like *7,6,2,5,4* and *9*. And it is the same kind of logic behind these sets of features I created.

*symmetryHorizontal* and *symmetryVertical* are measuring the dregree of symmetry in the image.

*horizontalDiffs* and *verticalDiffs *describes how tall and how fat each digit is.

## Feature Importance

I got the *feature importance* using* Random Forest* and here is a plot of that.

1 2 3 4 5 6 7 8 9 10 11 |
library(dplyr) featureImportance <- cbind.data.frame(featureName = names(engTest), MeanDecreaseGini = rf$importance[,"MeanDecreaseGini"]) featureImportance <- arrange(featureImportance, desc(MeanDecreaseGini)) featureImportance$featureName <- factor(featureImportance$featureName, levels = as.character(featureImportance$featureName)) ggplot(data = featureImportance, aes(x = featureName, y = MeanDecreaseGini)) + geom_bar(stat = "identity", fill = "cyan3", width = 0.618) + coord_flip() + ggtitle("Feature Importance From Random Forest") ggsave("featureImportance.png") |

It seems that the features I created are somewhat important. To see it clearly, here is the list of the top 20 features.

1 |
head(featureImportance,20) |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
featureName MeanDecreaseGini 1 PC2 2195.8956 2 PC1 1853.3829 3 lastHorizontalPix 1818.2293 4 firstHorizontalPix 1786.4713 5 PC5 1452.5578 6 PC6 1173.8716 7 densityBottom 1136.1091 8 densityUpper 1114.4566 9 PC8 1016.7097 10 PC4 1013.9901 11 PC7 955.1968 12 densityUpperRight 842.0081 13 densityUpperLeft 795.8877 14 numberVerticalLines 777.9377 15 densityBottomLeft 714.5263 16 PC3 682.7365 17 verticalDiffs 666.8906 18 PC13 660.1781 19 firstVerticalPix 590.2418 20 densityBottomRight 581.4745 |

## Cross-Validation with PCA

Even we are down to a little bit more than 100 features only, but tuning mtry and ntree or even depth with K-Folds Cross-Validation together with PCA in each iteration is time consuming. I coded it up, but never really run it.

Here is the code for choosing the *mtry* value for a fixed number of trees using such a method.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
library(caret) k = 10 folds <- createFolds(Y, k = k) mtryGrid <- c(1,3,5,7,9) accuracy <- vector() testFolds <- vector() mtry <- vector() for (i in 1:k{ for (j in 1:length(mtryGrid)){ testingFold <- folds[[i]] cvTrain <- train[-testingFold,2:785] cvTrainLabel <- Y[-testingFold] cvTestLabel <- Y[testingFold] cvTest <- test[testingFold,] pca <- prcomp(cvTrain) cvPCTrain <- cbind.data.frame(pca$x[,1:100], cvTrainLabel) cvPCTest <- predict(pca, cvTest)[,1:100] rf <- randomForest(cvTrainLabel~., data = cvPCTrain, ntree = 100, mtry = mtryGrid[j]) rf.pred <- predict(rf, cvPCTest) t <- table(cvTestLabel, rf.pred) acc <- sum(diag(t))/sum(t) accuracy <- c(accuracy, acc) testFolds <- c(testFolds, i) mtry <- c(mtry, j) } } |

## Choosing mtry with OOB Error estimate

What I actually did is to look at the *OOB Error estimates*.

When each of the *decision tree* is built in the* Random Forest*, the training data for that decision is a *bootstrap sample* of the original dataset, on average it has *2/3* unique values from the original training set. So we can estimate the out-of-sample error rate with the 1/3 *out-of-bag* data. This is a valid way of measuring the out-of-sample error rate, and is much faster to do for *Random Forest* when compared with *K-Folds Cross-Validation*.

I searched a grid of *mtry* and *ntree* values using the OOB method, and generated a plot.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
forests <- list() ntreeTry <- c(1:40)*100 for (i in 1:length(ntreeTry)){ print(paste("start to try ntreeTry ",ntreeTry[i])) set.seed(123) rf <- tuneRF(x = engTrain[,-1], y = engTrain$label, mtryStart = 1, ntreeTry = ntreeTry[i], plot = F) forests[[i]] <- rf } df <- cbind.data.frame(ntrees = sort(rep(1:40,4)*100), mtry = rep(c(1,2,4,8),10)) qplot(data = df, x = ntrees, y = OOBError,color = mtry, group = mtry, geom = "line") |

It seems that for higher *mtry* value the *ntree* value required for the *OOB Error* starts to converge is *lower*. It looks to me that picking *mtry = 8*, *ntree = 1000* is a reasonable choice.

I am not sure if the *one-standard error rule* would apply here or not, but after the calculation, I found that for *mtry = 8* the minimum number of trees needed so that the OOB Error is no higher than one standard error from the minimum value is *500*.

So in a much longer searching time, I found the combination of *mtry = 8* and *ntree = 500*, which is almost exactly the same as the ones I hand picked last year....

The hyper parameter values are almost identical, however I made some new features which I didn't have last year, would these looked promising features really help in terms of making the leaderboard happier?

1 2 3 4 5 6 7 8 |
set.seed(0306) t0 <- Sys.time() rf <- randomForest(label~., data = engTrain, ntree = 500, mtry = 8) print(Sys.time() - t0) submit <- data.frame(ImageId = seq(1,nrow(engTest)), Label = predict(rf, engTest, type = "class")) write.csv(submit, file = "500Tree8MtryRFwith100PCAAndEngineeredFeatures.csv", row.names=F) |

Not really... the leaderboard score drops to 0.94329.

## Closing

It may seem that I did a lot things in vain* : <* , however I think they are worth trying. No one can be lucky in every model he builds.