Category Archives: Analytic

Kaggle Digit Recognizer Revisited (Using Convolutional NN with Keras)

Almost a year ago, I revisited the Kaggle version of the Hand Written Digit Recognition problem, the link to that post is here. At that time, my go to language is R, since the majority of friends around me use R as well. This summer, I evidently switched back to use python as my primary language to do almost everything, because it is just so efficient.

So, here is a convolutional neural network using Keras to tackle this problem again, in less than 100 lines of code you can get a convolutional neural network and obtain 99% accuracy on the Kaggle leaderboard.

A quick note about training time, it took close to 9 minutes to be trained on my laptop with GeForce GTX 970M chip. You can increase the number of epochs and run it by yourself, it should be able to lead to better results.


A Logistic Regression Benchmark for Red Hat Customer Business Value Prediction Problem

Red Hat put out a competition on Kaggle asking people to build models to predict customer potential. It is a simple binary classification problem and the metric to this problem that Red Hat wanted to determine which model rank best is the AUC score.

I am sort of late in participating in this competition, and there are only 7 days to go. I sketched a rather simple logistic regression model, and it ranks somewhere in the middle among 2,200 teams in total. Kind of surprised to see that a simple logistic regression can beat half of the participants.

My model uses all the features and I find out the penalty strength parameter C should take on value 10.

Below is my code:

A quick post on Decision Trees

Today is my one year anniversary studying at Bentley University in MA, US. And this short post on this special day is devoted to Decision Trees, a simple but often misused method. During my short time here, and limited times talking to marketing analysts, I saw some cases when they just take the rules directly from decision trees and throw them in decks and say that they will show that to executives as "insights". I think they should be a bit more careful since the decision tree can be rather unstable, i.e. small changes in the dataset can result in completely different trees.

Below are some python scripts, which i used to build classification trees on the iris data. But each time, I do some sort of random sampling and feed the sample to the algorithm to build trees. You can look at the difference  yourself.

I will just upload pictures of a few of these trees.

iris_dt_entropy_70_percent_train_seed_36 iris_dt_entropy_70_percent_train_seed_1106

The two trees above are built on different random samples of the iris dataset. From the first one you can get the rule: if petal length is less than or equal to 2.45 cm then the flower is a setosa. Wheras from the second decision tree you get the rule: if petal width is less than or equal to 0.8 cm then the flower is a setosa. Take a quick look at the scatter plot you will see that they are both pretty good rules, and if you only build one decision tree, you will miss one of the very good rules.


Below are two trees built on different samples of the iris dataset again. And I personly would not try to interpret the splitting rules from the third level in the tree and on (take root node as the first level).

iris_dt_entropy_num_per_class_15_45_5 iris_dt_entropy_num_per_class_15_45_10

Anyway, the takeaway is that: try different algorithms, try different samples as input, build many trees and take what's common from them, rather than simply build one tree.

D3: Visualizing Titanic Survivors by Gender, Age and Class


Titanic: Machine Learning from Disaster is the 101 type of machine learning competition hosted on Kaggle since it started. The task is to predict who would survive the disaster given information on individual's age, gender, socio-economic status(class) and various other features.

Recently, during the winter break, I have started learning the JavaScript library D3.js. The above graph is a screenshot of my first visualization project I created with D3.js. The address to the live version of the project is here.

How to read the graph:

  • Each rectangle in the graph represents a passenger on Titanic, color yellow means that the passenger survived the disaster and the color blue indicates that he does not.
  • There can be multiple people with the same age, gender, and class values, so I set the opacity of these rectangles to be 20%. So the place on the graph where you can see solid yellow shows that those passengers have a higher chance of surviving, whereas solid blue indicates danger.

Based on this visualization we can see that:

  1. females (young or old, except around age 25) and young males(under age 15) from middle and upper class tend to survive.
  2. the overall survivor rate for female passengers is higher than male passengers.

So, without all the drama shown in the classic movie, this visualization basically predicts that Jack will most likely not able to make it, but Rose will survive...


Update: Jan 14

I made a couple of changes to the visualization during last few days. Now a newer version is available here.


Make an Animation with R

In a recent project for my optimization and simulation class, I was tackling a portfolio optimization problem. I was asked to find a portfolio with the same return rate but lower portfolio variance. Using quadratic programming to obtain the minimal variance portfolio is not an issue for me.

The target portfolio has a expected return of 2.22% and the associated portfolio standard deviation is 3.31%. Using optimization, I found a portfolio maintaining the same return rate but a lower portfolio standard deviation of 2.85%.

Here is a plot comparing the simulated return distributions for these two portfolios.


After I got this plot, I was thinking about a way to show the differences in a dynamic way, better if I can make an animation for it and I wish to do so in R.  I found that the slope diagram might be a good chart type for this task and I found an R package called animation for creating animation within R.

There is no default geom in ggplot for creating the slope diagram, however, I found that one can create slope diagram using geom_segment().

Here is an example of slope diagram.  Basically, the vertical position determines the portfolio returns and for each iteration in the simulation we add a line segment connecting the two different portfolio return values from both sides and coloring the line segment.


To use the animation package, one need to write a loop or a function that can plot all the plots needed to generate the animation in time sequence order.

You can define your own function to make all the plots needed for your animation. And once you have done this, you can use the saveVideo() function from the animation package to generate a .mp4 animation.

To do so, you just need to specify your operation system and the path to ffmpeg executable in your system and then calling saveVideo() by passing the plot making function along with some parameters for setting up the video quality.

Here is the animation created using the above code.

The animation dynamically showing the performance difference between two portfolios, while the minimal variance portfolio has a lower chance of losing a huge amount of money, it also gives up some chance to earning large returns. This is depicted by the difference between density of points on the lower and upper parts in two sides. Also, the amount of red color and green color are approximately equal, suggesting that half of the time the minimal variance will yield a lower return than the target portfolio.

There are many other options in the animation package. You can use it to generate GIF, HTML or SWF animations with similar method.