Kaggle Digit Recognizer Revisited (Using R)

Digit Recognizer is a competition that has been hosted on Kaggle for years( almost three and a half years so far?). The data for this competition were taken from the famous MNIST dataset, which has been extensively studied within the machine learning community( here is more information on that).

I tried the competition back a year ago with very naive method( basically a wrong way of performing PCA and Random Forest with hand picked hyperparameter values), and recently during the thanksgiving break, I got some time and revisited the competition to see if I can improve on my past results.

The training and testing datasets are all .csv files. Not big, but using fread() function from data.table package can be a bit faster.

Look at the dimensions of the data set

The number of observations in the training set is n = 42000 and the dimension of the feature space  is p = 784.

Looked at these numbers, I thought that the dimension of features is somewhat high and we might want to perform PCA on it.

Note, in the textbook, high dimensional problems refer to the situations when p is greater than n, so not the case here technically.

Each observation represents a 28 by 28 pixels grey scale image, and each cell value is an integer between 0 and 255. Our task is to build a model to predict what digit(0 to 9) each image is representing.

The labels are pretty balanced in the training set.

Visualise the images in the dataset

We need to reshape the observations(each is a single row with 784 values) back into a 28 by 28 matrix in order to display the images.

We can do so by writing a simple function to perform that transformation. Here is another observation: Try a K-NN model first

Before directly go and perform PCA, I tried a K-NN model on the dataset first.

The K-NN model is simply to look at the K points nearest to the one we want to make a prediction on by in some sense taking an average of the K nearest neighbors' value as the predicted value.

The model is usually used to demonstrate the problem called the curse of dimensionality, that is when the dimensionality of the feature space is high then the feature space is going to be very sparse and, as a result, the distances between any pair of data points are going to be very long. So for any given point in the feature space, the region(which should be a small local region ideally) that tightly bound the K nearest points to the given point is going to be very large and thus became a problem for K-NN model.

Ok, I didn't think about all these complicated stuff last year when I did it. I built the K-NN model simply because it is the easiest to code up one and I was lazy and always wanted to see how simple model works.

So it took two and a half hour(the runtime is recalculated using my newer laptop this year, the same to all the runtime in this post) to build the K-NN model on the full feature space with k = 50. And actually the result is not bad, this model will give a leader board score of 0.94743. But it is quite long to build such a single model. If we would use K-Folds Cross-Validation to choose the optimal k value, it will be days if not weeks for a handful pick of k values.

PCA the wrong way

Last year, when I performed PCA on the dataset, I actually merged the training features and testing features into a single dataset, calculated PCA and transformed the merged data set onto the first 100 PC directions. The percentage of variance explained by the first 100 principal components is about 91.5%, and I transform the merged data set down to 100 dimensions.

I know now this is actually cheating because I was implicitly using the information of the testing set for the model building process. I should only use the training set to get the principal component directions and then transform the training and testing data using the directions learned this way.

But anyway the result showed that we can use PCA to greatly reduce the dimension of the features without loss of much information.

Oh, I did make a plot to visualise the whole dataset using the first two principal components. A Random Forest fitted to the leaderboard

When building the Random Forest model last year, I actually tried the trial and error to make the leaderboard happy... That is to manually change the values of mtry and ntree and see how the Kaggle leaderboard result would change.

The score is 0.9510 which is not bad, and an improve over the K-NN model.

And this is how far I went last year( March 2014 to be specific).

The right way to perform PCA

The right way to perform PCA on this problem is to learn PCs using the training set and project both the training set and testing set onto the first some number of principal component directions.

And things will become complicated when we do K-Folds Cross-Validation, since that we should perform PCA only on the K-1 folds to learn the directions and transform all the K folds on to that first some number of PC directions before we build model on the transformed K-1 folds data and finally gain the estimated out of sample error rate for this run, So the run time for K-Folds Cross-Validation now is even worse since we need to do PCA in each iteration.

But it maybe not that bad if training a Random Forest is not that slow, but how can we know how long it will take? We can analysis the runtime.

Understand the runtime of Random Forest

This thanksgiving break, I wanted to know how long it will take to build a model on the full training set with a large number of trees in a Random Forest.

I did so doing some experiments on a grid of values of the number of training observations and the number of trees in the Random Forest model and tried to understand the time complexity for building a Random Forest model with the default mtry value.

I plotted two sets of plots  So based on this two plots, if we fix the number of training observations(of course mtry and depth of trees are also fixed because they are left to be the default values), then the training time is linear in the number of trees in the forest. And it is the same linear relationship between training time and the number of observations used in training if the number of trees is fixed.

I guess the runtime for the Random Forest model is O(n*ntree) if mtry and depth of trees are fixed.

Actually, I later on found a post on Quora which states that the running time is O(n*ntree*d*mtry).

So I built a regression model using the interaction term ntree*ntrain as the independent variable to predict the training time by heavily extrapolating...(which is bad for regression analysis, in general, but might be OK for runtime analysis purpose)

So the model predicts that if I want to build a Random Forest with 5000 trees on full features, it will take roughly two and a half hours. Which is about as slow as building the K-NN model, tuning the parameters using K-Folds Cross-Validation may be too slow.

Feature Engineering

This year I also tried to create some features to help the models, I made them based on my intuition about how a human - if he does not know how to distinguish them- would find useful for this task if he is trained by other people showing them examples of digits.

firstHorizontalPix and lastHorizontalPix are to capture the row number where the first or last non-background color pixel appears in the image. Similarily, firstVerticalPix and lastVerticalPix are to capture the column number where the first or last non-background color pixel appears in the image. After randomly plotted many digits in the dataset I got the intuition that digit tends to be shorter than digit 1 in this dataset, and, of course, digit 7 tends to be fatter than digit 1. So I also counted the number of non-empty rows with numberHorizontalLines and the number of non-empty columns with numberVerticalLines.

densityUpperLeftdensityUpperRightdensityBottomLeft and densityBottomRight are created to capture how dense the pixel values are in each part of the image, And densityUpper, densityBottom, densityLeft, densityRight are just some partial sum of them. My intuition is that for digits like 8,0,1,3, the pixel value density on the upper area is going to be close to the bottom part, but it will be different for digits like 7,6,2,5,4 and 9. And it is the same kind of logic behind these sets of features I created.

symmetryHorizontal and symmetryVertical are measuring the dregree of symmetry in the image.

horizontalDiffs and verticalDiffs describes how tall and how fat each digit is.

Feature Importance

I got the feature importance using Random Forest and here is a plot of that. It seems that the features I created are somewhat important. To see it clearly, here is the list of the top 20 features.

Cross-Validation with PCA

Even we are down to a little bit more than 100 features only, but tuning mtry and ntree or even depth with K-Folds Cross-Validation together with PCA in each iteration is time consuming. I coded it up, but never really run it.

Here is the code for choosing the mtry value for a fixed number of trees using such a method.

Choosing mtry with OOB Error estimate

What I actually did is to look at the OOB Error estimates.

When each of the decision tree is built in the Random Forest, the training data for that decision is a bootstrap sample of the original dataset, on average it has 2/3 unique values from the original training set. So we can estimate the out-of-sample error rate with the 1/3 out-of-bag data. This is a valid way of measuring the out-of-sample error rate, and is much faster to do for Random Forest when compared with K-Folds Cross-Validation.

I searched a grid of mtry and ntree values using the OOB method, and generated a plot. It seems that for higher mtry value the ntree  value required for the OOB Error starts to converge is lower. It looks to me that picking mtry = 8, ntree = 1000 is a reasonable choice.

I am not sure if the one-standard error rule would apply here or not, but after the calculation, I found that for mtry = 8 the minimum number of trees needed so that the OOB Error is no higher than one standard error from the minimum value is 500.

So in a much longer searching time, I found the combination of mtry = 8 and ntree = 500, which is almost exactly the same as the ones I hand picked last year....

The hyper parameter values are almost identical, however I made some new features which I didn't have last year, would these looked promising features really help in terms of making the leaderboard happier?

Not really... the leaderboard score drops to 0.94329.

Closing

It may seem that I did a lot things in vain : < , however I think they are worth trying. No one can be lucky in every model he builds.

R Workshops I hosted recently

I am hosting a small workshop every week for more than 2 months now at Bentley University.

Basically, I was introducing machine learning and data analytic using R. These workshops are about one hour and a half each time, and I usually would spend 7-8 hours preparing materials. I find that the process is actually extremely helpful to me as well.

The thing I am concerned the most is how I can use examples that is easy enough for people without much tech and math background to get enough intuition and understanding. Some of the people attend my workshop actually seldom do any coding in R( or any programming language),

I guess the fact that they still came every week is the best proof that I been doing a good job explaining procedures like cross-validation, bootstrapping or the algorithms behind K-means and decision trees or building intuitions of SVM and PCA.

If you are interested in taking a look at the materials for my workshops, you can find them here. Here is an example slide for our workshop.

We will start doing Kaggle competitions from November, or at least... I will start doing that again in November and share my experiences.

My Solution to the Analytic Edge Kaggle Competition

The Kaggle competition for the MIT's course The Analytic Edge on edX is now over.

Here is my solution to the competition, it ranked 244 on the final leader board, which is by no means great but still rank within top 10%.

My approach is to minimizing the effort to do feature engineering by hand. Instead using a couple of automated methods to extracting features from the text in the dataset. For the modeling process, I used the Xgboost package, using cross-validation to pick the number of iterations to avoid overfit to the training set.

0 House Keeping

Set up working directory 设定工作环境

Function Definition

Function for dummy encoding 用于生成虚拟编码的函数

1 Data Preparing 数据准备工作

Imputing missing categorical data 填补下缺失数据
I used a really simple approach, after inspecting the correlation among them.

Remove data entries which has a NewsDesk value not appeared in the testing data

Change "" to "Other", No effect on modeling, just don't like to see ""

Log Transform "WordCount" 将字数做个对数转变

2 Feature Extraction 提取特征

QR = Question Mark in the title?
QR表示标题中有无问号

Extract Hour and day in week

Extract all headline and abstract to form a corpus

Corpus processing

Document ~ TF-IDF matrix And Document ~ TF matrix

Get frequent terms matrix as feature

Clustering  聚类

PCA 主要成分分析

LDA  潜在狄利克雷分配

Dummy Encoding 虚拟编码

3 Model Fitting 模型拟合

Using cross validation to pick number of rounds  for Xgboost

Week 2 VIDEO 4

Linear Regression (one variable)

I will try to use scikit-learn to do modeling in Python

Notice the scikit learn expect X to be a 2-D matrix rather than an 1-D array when fitting the model So simply using clf.fit(wine['AGST'], wine['Price']) won't work

Or equvalently

There is no simple way to output a detailed summary of the model in scikit-learn like summary(model) in R

You can get the intercept and coefficients of the fitted model like this:

Sum of Squared Errors

If you really wanted detailed statistical summary of the model, take a look at statsmodels rather than scikit-learn

Linear Regression (all variables)

Store the predictors in a list

I guess I should produce a nicer printing...

VIDEO 6

Correlations

Pearson correlation coefficient between two variables

Correlation Matrix

Is There A Relationship Between Voter Gender And Preference Among Candidates ?

Introduction:

I'm interested in whether there is a relationship between voter's gender and their preference in candidates. To be specfic, I want to test if there is a statistical significance difference in term of voting preference between male and female voters. The result of this reserch might rise question like whether voting is bias due to voters' gender in general, and that would be an interesting question for further research.

Data:

I am using the American National Elections Study(ANES)[http://www.electionstudies.org/] dataset for this research. The dataset is collected by conducting surveys of voters in the United States before and after every presidential election. Even though the surveys are carryed out to random sample of all U.S voters, since there is no experiment involved in the data collection stage, my data analysis on this dataset is an observational study. My research result will only be able to generalize to the people who were intended to vote in the 2008 election, it can not be generalized to all Americans so as to the 2012 election. And also, I can not draw causual relationship between voters' gender and there voting perference.

Each case in the dataset represents a voter that is been surveyed by ANES. I will be using voters' gender and who they are interested in voting for in the 2008 election as the two variables to work on. The voters gender is coded as “gender_respondent”“ in the dataset, and it is a categorical variable taking values in "Female” or “Male”. The candidate to whom the voter is interested in voting for is coded as “interest_whovote2008” in the dataset, and it is also a categorical variable which takes value in “Barack Obama”, “John Mccain” and “Other”

Exploratory data analysis:

Since the dataset also included people who didn't vote in 2008 election, first a subset of the original dataset that only contain people who actually voted in the 2008 election with the two variables of interest is created. Among the people who did report voted in the 2008 election, those who didn't report the candidate he or she voted for is then excluded from the subset. So anesVot is the name for the created dataset that will be used in the following research.

The obtained dataset anesVot contains 4520 observations, each case in the dataset represents an individual that not only voted in the 2008 election, but also reports who he or she voted.

Overall, there are 2342 Females and 2178 Males in the dataset. There are 2704 voters who voted for Barack Obama, 1702 voters who voted for John Mccain, and 114 voters who voted for candidates other than the pervious two.

63% female voters voted for Barack Obama, 56% male voters voted for Barack Obama.

The percentage of voters voted for John Mccain in both genders are the same 38%.

3% female voters and 2% male voters voted for candidates other than the two major competitors. So it seems from this calculations that female voters tend to slightly perfer Barack than John Mccain and other candidate.

To see it graphically, the above plot show the difference between gender and perfered candidates as describe before.

This difference could be ture, however it could also be simply due to chance. If we draw another sample from the people who voted in the 2008 election, this difference might not appear at all. So I want to carry out a Chi-square independence test.

Inference:

Since I want to test the dependent relationship between two categorical variables, I will use Chi-square independence test.

The null hypothesis ${H}_{0}$ for my Chi-square independence test is: Voter gender and preference among candidates are independent. And the alternative hypothesis ${H}_{A}$ is that:Voter gender and preference among candidates are dependent.

Let's check the conditions for carry out the desired Chi-square independence test. The data is collectied on a random sample of U.S voters without replacement. According to the official Federal Election Commission report [http://www.fec.gov/pubrec/fe2008/federalelections2008.pdf] there are 131,313,820 voters voted in the 2008 election.

So the sample is less 10% of the population.

The expected count is calculated using the formula

$expected count = frac{row total times column total}{table total}$ Calculating expected values for each cell will produce a table as follows:

So this satisfied the required condition that each cell has at least 5 expected value.

Recall the forumla for Chi-square is:

χ 2 = i=1 k (OE) 2  E   and formula for degree of freedom is:

df=(R1)×(C1) There is a function that's built in R that can do this calculation for us.

The out put suggest the calculation being carried out earlier is correct.

The resulting $p$ value is less than $0.05$ suggests that we should reject the null hypothesis ${H}_{0}$ and conclude that there is a dependent relationship between voter gender and preference among candidates.

Conclusion:

This research finds out that voters' gender do affect who he or she is likely to vote for in the 2008 election. It seems that Barack Obama successfully win over more womens' favour than John did.

As for future studies, I guess I will try to find data and see is this kind of dependent relationship can be found in previous and follow-up elections.

Reference

 http://www.electionstudies.org/ “The American National Election Studies”

 http://www.fec.gov/pubrec/fe2008/federalelections2008.pdf “Official Federal Election Commission report”