Tag Archives: Analytic

Repost of my KDD Cup 2018 summary

I finished 30th place at this year's KDD CUP. I still remember back to 2015, when I was very rusty with coding and tried to attempt that years' KDD cup with my potato laptop Lenovo U310. I did not know what I was doing, all I did is trying to throw data into XGBoost and my performance then is a joke. I see myself became more and more capable of comming up with ideas and implement them out during these two years. And below is a repost of my summary to KDD 2018.

Hooray~! fellow KDD competitors. I entered this competition on day 1 and very quickly established a reasonable baseline. Due to some personal side of things, I practically stopped improving my solutions since the beginning of May. Even though my methods did not work really well compared to many top players in phase 2, but I think my solution may worth sharing due to it is relative simplicity. I did not touch the meo data at all, and one of my models is just calculating medians.

Alternative data source

For new hourly air quality data, as shared in the forum, I am using this for London and this for Beijing instead of the API from the organizer.

Handling missing data

I filled missing values in air quality data with 3 steps:

  1. Fill missing values for a station-measure combo based on the values from other stations.
    To be specific: I trained 131 lightgbm regressors for this. If PM2.5 reading on 2:00 May 20th is missing for Beijing aotizhongxin station, the regressor aotizhongxin_aq-PM2.5 will predict this value based on known PM2.5 readings on 2:00 May 20th from 34 other stations in Beijing.
    I used thresholds to decide whether to do this imputation or not. If more than the threshold number of stations also don't have a reading, then skip this step.
  2. Fill the remaining missing values by looking forward and backward to find known values.
  3. Finally, replace all remaining missing values by overall mean value.


1. median of medians

This is a simple model that worked reasonably well in this Kaggle competition.

To predict PM2.5 reading on 2:00 May 20th for aotizhongxin, look back for a window of days history, calculating the median 2:00 PM2.5 readings from aotizhongxin in that window. You do this median calculation exercise for a bunch of different window sizes to obtain a bunch medians. The median value of those medians is used as the prediction.

Intuitively this is just an aggregated yesterday once more. With more larger windows in the collection, the model memorizes the long-term trend better. The more you add in smaller windows, the quicker the model would respond to recent events.

2. facebooks' prophet

This is practically even simpler than the median of medians. I treated the number of days history I throw at it and the model parameters changepoint_prior_scalen_changepoints as main hyperparameters and tweaked them. I did a bit work to parallelizing the fitting process for all the station-measure combos to speed up the fitting process, other than that, it is pretty much out of the box.

I tried to use holiday indicator or tweaking other parameters of the model and they all degrade the performance of the model.

3. neural network

My neural network is a simple feed-forward network with a single shortcut, shamelessly copied the structure from a senior colleague's Kaggle solution with tweaked hidden layer sizes.
The model looks like this:

The input to the neural network are concatenated (1) raw history readings, (2) median summary values from different window_sizes, and (3) indicator variables for the city, type of measure.

The output layer in the network is a dense layer with 48 units, each corresponding to an hourly reading in the next 48 hours.

The model is trained directly using smape as loss function with Adam optimizer. I tried standardizing inputs into zero mean and unit variance, but it will cause a problem when used together with smape loss, thus I tried switching to a clipped version MAE loss, which produced similar results compared to raw input with smape loss.

The model can be trained on CPU only machine in very short time.

I tried out some CNN, RNN models but couldn't get them working better than this simple model, and had to abandon them.

Training and validation setup

This is pretty tricky, and I am still not quite sure if I have done it correctly or not.

For approach 1 and 2

I tried to generate predictions for a few historical months, calculating daily smape scores locally. Then sample 25 days out to calculate a mean smape score. Do this sample-scoring a large number of times and take mean as local validation score. I used this score to select parameters.

For neural network

I split the history data into (X, y) pairs based on a splitting day, and then move the splitting day backward by 1 day to generate another (X, y) pair. Do this 60 times and vertically concatenate them to form my training data.

I used groupedCV split on the concatenated dataset to do cross-validation so that measures from one station don't end up in both training and validation set. During training, the batch size is specified so that data in the batch all based on the same splitting day. I did this trying to preventing information leaking.

I got average smape scores 0.40~44 for Beijing and 0.30-0.34 for London in my local validation setting. Which I think is pretty aligned with how it averages out through May.


Without utilizing any other weather information or integrating any sort of forecasts, all my models failed miserably for events like the sudden peak on May 27th in Beijing.

My Solution to the Analytic Edge Kaggle Competition

The Kaggle competition for the MIT's course The Analytic Edge on edX is now over.

Here is my solution to the competition, it ranked 244 on the final leader board, which is by no means great but still rank within top 10%.

My approach is to minimizing the effort to do feature engineering by hand. Instead using a couple of automated methods to extracting features from the text in the dataset. For the modeling process, I used the Xgboost package, using cross-validation to pick the number of iterations to avoid overfit to the training set.

0 House Keeping

Set up working directory 设定工作环境

Load Libraries 装载函数包

Function Definition

Function for dummy encoding 用于生成虚拟编码的函数

1 Data Preparing 数据准备工作

Loading Data

Imputing missing categorical data 填补下缺失数据
I used a really simple approach, after inspecting the correlation among them.

Remove data entries which has a NewsDesk value not appeared in the testing data

Change "" to "Other", No effect on modeling, just don't like to see ""

Log Transform "WordCount" 将字数做个对数转变

2 Feature Extraction 提取特征

QR = Question Mark in the title?

Extract Hour and day in week

Extract all headline and abstract to form a corpus

Corpus processing

Document ~ TF-IDF matrix And Document ~ TF matrix

构建文档~TF-IDF矩阵 以及文档~TF矩阵

Get frequent terms matrix as feature

Clustering  聚类

PCA 主要成分分析

LDA  潜在狄利克雷分配

Dummy Encoding 虚拟编码

3 Model Fitting 模型拟合

Using cross validation to pick number of rounds  for Xgboost



The Analytic Edge Lecture code in Python Week8 Crime

VIDEO 3 - A Basic Line Plot

Load our data:

Convert the Date variable to time format

Extract the hour and the day of the week:

Let's take a look at the structure of our data again:

Create a simple line plot - need the total number of crimes on each day of the week.

We can get this information by creating a table:

Save this table as a data frame:

Create our plot



Make the "Weekdays" variable an ORDERED

Try again:



make the x values as ordered categorical data type won't help, it is

probably due to the difference in implementation ggplot in Python.



Change our x and y labels:



VIDEO 4 - Adding the Hour of the Day

Create a counts table for the weekday and hour:

Save this to a data frame:

Get variables Hour and Weekdays

Create out plot/Change the colors

ggplot in python does not support aes(group=Var) yet

Also the ggplot version I got using pip does not seem to plot legend correctly.


Redo our plot, this time coloring by Type:


Make the lines a little transparent:


Make a heatmap:

Struggled to plot heatmap using ggplot in Python without success. Turning to the old friend matplotlib



Change the color scheme and legend label


VIDEO 5 - Maps

Given up on it...

The Analytic Edge Lecture code in Python Week6 Netflix

Video 6

After following the steps in the video, load the data into R

Add column names

Remove unnecessary variables

Remove duplicates and then take a look at our data again:

There is a drop_duplicates function in pandas. The unique function in pandas is for Series rahter than DataFrame

Video 7

Compute distances

Hierarchical clustering

Plot the dendrogram