Category Archives: Analytic

Repost of my KDD Cup 2018 summary

I finished 30th place at this year's KDD CUP. I still remember back to 2015, when I was very rusty with coding and tried to attempt that years' KDD cup with my potato laptop Lenovo U310. I did not know what I was doing, all I did is trying to throw data into XGBoost and my performance then is a joke. I see myself became more and more capable of comming up with ideas and implement them out during these two years. And below is a repost of my summary to KDD 2018.

Hooray~! fellow KDD competitors. I entered this competition on day 1 and very quickly established a reasonable baseline. Due to some personal side of things, I practically stopped improving my solutions since the beginning of May. Even though my methods did not work really well compared to many top players in phase 2, but I think my solution may worth sharing due to it is relative simplicity. I did not touch the meo data at all, and one of my models is just calculating medians.

Alternative data source

For new hourly air quality data, as shared in the forum, I am using this for London and this for Beijing instead of the API from the organizer.

Handling missing data

I filled missing values in air quality data with 3 steps:

  1. Fill missing values for a station-measure combo based on the values from other stations.
    To be specific: I trained 131 lightgbm regressors for this. If PM2.5 reading on 2:00 May 20th is missing for Beijing aotizhongxin station, the regressor aotizhongxin_aq-PM2.5 will predict this value based on known PM2.5 readings on 2:00 May 20th from 34 other stations in Beijing.
    I used thresholds to decide whether to do this imputation or not. If more than the threshold number of stations also don't have a reading, then skip this step.
  2. Fill the remaining missing values by looking forward and backward to find known values.
  3. Finally, replace all remaining missing values by overall mean value.


1. median of medians

This is a simple model that worked reasonably well in this Kaggle competition.

To predict PM2.5 reading on 2:00 May 20th for aotizhongxin, look back for a window of days history, calculating the median 2:00 PM2.5 readings from aotizhongxin in that window. You do this median calculation exercise for a bunch of different window sizes to obtain a bunch medians. The median value of those medians is used as the prediction.

Intuitively this is just an aggregated yesterday once more. With more larger windows in the collection, the model memorizes the long-term trend better. The more you add in smaller windows, the quicker the model would respond to recent events.

2. facebooks' prophet

This is practically even simpler than the median of medians. I treated the number of days history I throw at it and the model parameters changepoint_prior_scalen_changepoints as main hyperparameters and tweaked them. I did a bit work to parallelizing the fitting process for all the station-measure combos to speed up the fitting process, other than that, it is pretty much out of the box.

I tried to use holiday indicator or tweaking other parameters of the model and they all degrade the performance of the model.

3. neural network

My neural network is a simple feed-forward network with a single shortcut, shamelessly copied the structure from a senior colleague's Kaggle solution with tweaked hidden layer sizes.
The model looks like this:

The input to the neural network are concatenated (1) raw history readings, (2) median summary values from different window_sizes, and (3) indicator variables for the city, type of measure.

The output layer in the network is a dense layer with 48 units, each corresponding to an hourly reading in the next 48 hours.

The model is trained directly using smape as loss function with Adam optimizer. I tried standardizing inputs into zero mean and unit variance, but it will cause a problem when used together with smape loss, thus I tried switching to a clipped version MAE loss, which produced similar results compared to raw input with smape loss.

The model can be trained on CPU only machine in very short time.

I tried out some CNN, RNN models but couldn't get them working better than this simple model, and had to abandon them.

Training and validation setup

This is pretty tricky, and I am still not quite sure if I have done it correctly or not.

For approach 1 and 2

I tried to generate predictions for a few historical months, calculating daily smape scores locally. Then sample 25 days out to calculate a mean smape score. Do this sample-scoring a large number of times and take mean as local validation score. I used this score to select parameters.

For neural network

I split the history data into (X, y) pairs based on a splitting day, and then move the splitting day backward by 1 day to generate another (X, y) pair. Do this 60 times and vertically concatenate them to form my training data.

I used groupedCV split on the concatenated dataset to do cross-validation so that measures from one station don't end up in both training and validation set. During training, the batch size is specified so that data in the batch all based on the same splitting day. I did this trying to preventing information leaking.

I got average smape scores 0.40~44 for Beijing and 0.30-0.34 for London in my local validation setting. Which I think is pretty aligned with how it averages out through May.


Without utilizing any other weather information or integrating any sort of forecasts, all my models failed miserably for events like the sudden peak on May 27th in Beijing.

Kaggle Digit Recognizer Revisited (Using Convolutional NN with Keras)

Almost a year ago, I revisited the Kaggle version of the Hand Written Digit Recognition problem, the link to that post is here. At that time, my go to language is R, since the majority of friends around me use R as well. This summer, I evidently switched back to use python as my primary language to do almost everything, because it is just so efficient.

So, here is a convolutional neural network using Keras to tackle this problem again, in less than 100 lines of code you can get a convolutional neural network and obtain 99% accuracy on the Kaggle leaderboard.

A quick note about training time, it took close to 9 minutes to be trained on my laptop with GeForce GTX 970M chip. You can increase the number of epochs and run it by yourself, it should be able to lead to better results.


A Logistic Regression Benchmark for Red Hat Customer Business Value Prediction Problem

Red Hat put out a competition on Kaggle asking people to build models to predict customer potential. It is a simple binary classification problem and the metric to this problem that Red Hat wanted to determine which model rank best is the AUC score.

I am sort of late in participating in this competition, and there are only 7 days to go. I sketched a rather simple logistic regression model, and it ranks somewhere in the middle among 2,200 teams in total. Kind of surprised to see that a simple logistic regression can beat half of the participants.

My model uses all the features and I find out the penalty strength parameter C should take on value 10.

Below is my code:

A quick post on Decision Trees

Today is my one year anniversary studying at Bentley University in MA, US. And this short post on this special day is devoted to Decision Trees, a simple but often misused method. During my short time here, and limited times talking to marketing analysts, I saw some cases when they just take the rules directly from decision trees and throw them in decks and say that they will show that to executives as "insights". I think they should be a bit more careful since the decision tree can be rather unstable, i.e. small changes in the dataset can result in completely different trees.

Below are some python scripts, which i used to build classification trees on the iris data. But each time, I do some sort of random sampling and feed the sample to the algorithm to build trees. You can look at the difference  yourself.

I will just upload pictures of a few of these trees.

iris_dt_entropy_70_percent_train_seed_36 iris_dt_entropy_70_percent_train_seed_1106

The two trees above are built on different random samples of the iris dataset. From the first one you can get the rule: if petal length is less than or equal to 2.45 cm then the flower is a setosa. Wheras from the second decision tree you get the rule: if petal width is less than or equal to 0.8 cm then the flower is a setosa. Take a quick look at the scatter plot you will see that they are both pretty good rules, and if you only build one decision tree, you will miss one of the very good rules.


Below are two trees built on different samples of the iris dataset again. And I personly would not try to interpret the splitting rules from the third level in the tree and on (take root node as the first level).

iris_dt_entropy_num_per_class_15_45_5 iris_dt_entropy_num_per_class_15_45_10

Anyway, the takeaway is that: try different algorithms, try different samples as input, build many trees and take what's common from them, rather than simply build one tree.

D3: Visualizing Titanic Survivors by Gender, Age and Class


Titanic: Machine Learning from Disaster is the 101 type of machine learning competition hosted on Kaggle since it started. The task is to predict who would survive the disaster given information on individual's age, gender, socio-economic status(class) and various other features.

Recently, during the winter break, I have started learning the JavaScript library D3.js. The above graph is a screenshot of my first visualization project I created with D3.js. The address to the live version of the project is here.

How to read the graph:

  • Each rectangle in the graph represents a passenger on Titanic, color yellow means that the passenger survived the disaster and the color blue indicates that he does not.
  • There can be multiple people with the same age, gender, and class values, so I set the opacity of these rectangles to be 20%. So the place on the graph where you can see solid yellow shows that those passengers have a higher chance of surviving, whereas solid blue indicates danger.

Based on this visualization we can see that:

  1. females (young or old, except around age 25) and young males(under age 15) from middle and upper class tend to survive.
  2. the overall survivor rate for female passengers is higher than male passengers.

So, without all the drama shown in the classic movie, this visualization basically predicts that Jack will most likely not able to make it, but Rose will survive...


Update: Jan 14

I made a couple of changes to the visualization during last few days. Now a newer version is available here.