# Repost of my KDD Cup 2018 summary

I finished 30th place at this year's KDD CUP. I still remember back to 2015, when I was very rusty with coding and tried to attempt that years' KDD cup with my potato laptop Lenovo U310. I did not know what I was doing, all I did is trying to throw data into XGBoost and my performance then is a joke. I see myself became more and more capable of comming up with ideas and implement them out during these two years. And below is a repost of my summary to KDD 2018.

Hooray~! fellow KDD competitors. I entered this competition on day 1 and very quickly established a reasonable baseline. Due to some personal side of things, I practically stopped improving my solutions since the beginning of May. Even though my methods did not work really well compared to many top players in phase 2, but I think my solution may worth sharing due to it is relative simplicity. I did not touch the meo data at all, and one of my models is just calculating medians.

### Alternative data source

For new hourly air quality data, as shared in the forum, I am using this for London and this for Beijing instead of the API from the organizer.

### Handling missing data

I filled missing values in air quality data with 3 steps:

1. Fill missing values for a station-measure combo based on the values from other stations.
To be specific: I trained 131 lightgbm regressors for this. If PM2.5 reading on 2:00 May 20th is missing for Beijing aotizhongxin station, the regressor aotizhongxin_aq-PM2.5 will predict this value based on known PM2.5 readings on 2:00 May 20th from 34 other stations in Beijing.
I used thresholds to decide whether to do this imputation or not. If more than the threshold number of stations also don't have a reading, then skip this step.
2. Fill the remaining missing values by looking forward and backward to find known values.
3. Finally, replace all remaining missing values by overall mean value.

### Approaches

#### 1. median of medians

This is a simple model that worked reasonably well in this Kaggle competition.

To predict PM2.5 reading on 2:00 May 20th for aotizhongxin, look back for a window of days history, calculating the median 2:00 PM2.5 readings from aotizhongxin in that window. You do this median calculation exercise for a bunch of different window sizes to obtain a bunch medians. The median value of those medians is used as the prediction.

Intuitively this is just an aggregated yesterday once more. With more larger windows in the collection, the model memorizes the long-term trend better. The more you add in smaller windows, the quicker the model would respond to recent events.

This is practically even simpler than the median of medians. I treated the number of days history I throw at it and the model parameters changepoint_prior_scalen_changepoints as main hyperparameters and tweaked them. I did a bit work to parallelizing the fitting process for all the station-measure combos to speed up the fitting process, other than that, it is pretty much out of the box.

I tried to use holiday indicator or tweaking other parameters of the model and they all degrade the performance of the model.

#### 3. neural network

My neural network is a simple feed-forward network with a single shortcut, shamelessly copied the structure from a senior colleague's Kaggle solution with tweaked hidden layer sizes.
The model looks like this:

The input to the neural network are concatenated (1) raw history readings, (2) median summary values from different window_sizes, and (3) indicator variables for the city, type of measure.

The output layer in the network is a dense layer with 48 units, each corresponding to an hourly reading in the next 48 hours.

The model is trained directly using smape as loss function with Adam optimizer. I tried standardizing inputs into zero mean and unit variance, but it will cause a problem when used together with smape loss, thus I tried switching to a clipped version MAE loss, which produced similar results compared to raw input with smape loss.

The model can be trained on CPU only machine in very short time.

I tried out some CNN, RNN models but couldn't get them working better than this simple model, and had to abandon them.

### Training and validation setup

This is pretty tricky, and I am still not quite sure if I have done it correctly or not.

#### For approach 1 and 2

I tried to generate predictions for a few historical months, calculating daily smape scores locally. Then sample 25 days out to calculate a mean smape score. Do this sample-scoring a large number of times and take mean as local validation score. I used this score to select parameters.

#### For neural network

I split the history data into (X, y) pairs based on a splitting day, and then move the splitting day backward by 1 day to generate another (X, y) pair. Do this 60 times and vertically concatenate them to form my training data.

I used groupedCV split on the concatenated dataset to do cross-validation so that measures from one station don't end up in both training and validation set. During training, the batch size is specified so that data in the batch all based on the same splitting day. I did this trying to preventing information leaking.

I got average smape scores 0.40~44 for Beijing and 0.30-0.34 for London in my local validation setting. Which I think is pretty aligned with how it averages out through May.

### Closing

Without utilizing any other weather information or integrating any sort of forecasts, all my models failed miserably for events like the sudden peak on May 27th in Beijing.

# A quick post on Decision Trees

Today is my one year anniversary studying at Bentley University in MA, US. And this short post on this special day is devoted to Decision Trees, a simple but often misused method. During my short time here, and limited times talking to marketing analysts, I saw some cases when they just take the rules directly from decision trees and throw them in decks and say that they will show that to executives as "insights". I think they should be a bit more careful since the decision tree can be rather unstable, i.e. small changes in the dataset can result in completely different trees.

Below are some python scripts, which i used to build classification trees on the iris data. But each time, I do some sort of random sampling and feed the sample to the algorithm to build trees. You can look at the difference  yourself.

I will just upload pictures of a few of these trees.

The two trees above are built on different random samples of the iris dataset. From the first one you can get the rule: if petal length is less than or equal to 2.45 cm then the flower is a setosa. Wheras from the second decision tree you get the rule: if petal width is less than or equal to 0.8 cm then the flower is a setosa. Take a quick look at the scatter plot you will see that they are both pretty good rules, and if you only build one decision tree, you will miss one of the very good rules.

Below are two trees built on different samples of the iris dataset again. And I personly would not try to interpret the splitting rules from the third level in the tree and on (take root node as the first level).

Anyway, the takeaway is that: try different algorithms, try different samples as input, build many trees and take what's common from them, rather than simply build one tree.

# Kaggle Digit Recognizer Revisited (Using R)

Digit Recognizer is a competition that has been hosted on Kaggle for years( almost three and a half years so far?). The data for this competition were taken from the famous MNIST dataset, which has been extensively studied within the machine learning community( here is more information on that).

I tried the competition back a year ago with very naive method( basically a wrong way of performing PCA and Random Forest with hand picked hyperparameter values), and recently during the thanksgiving break, I got some time and revisited the competition to see if I can improve on my past results.

The training and testing datasets are all .csv files. Not big, but using fread() function from data.table package can be a bit faster.

## Look at the dimensions of the data set

The number of observations in the training set is n = 42000 and the dimension of the feature space  is p = 784.

Looked at these numbers, I thought that the dimension of features is somewhat high and we might want to perform PCA on it.

Note, in the textbook, high dimensional problems refer to the situations when p is greater than n, so not the case here technically.

Each observation represents a 28 by 28 pixels grey scale image, and each cell value is an integer between 0 and 255. Our task is to build a model to predict what digit(0 to 9) each image is representing.

The labels are pretty balanced in the training set.

## Visualise the images in the dataset

We need to reshape the observations(each is a single row with 784 values) back into a 28 by 28 matrix in order to display the images.

We can do so by writing a simple function to perform that transformation.

Here is another observation:

## Try a K-NN model first

Before directly go and perform PCA, I tried a K-NN model on the dataset first.

The K-NN model is simply to look at the K points nearest to the one we want to make a prediction on by in some sense taking an average of the K nearest neighbors' value as the predicted value.

The model is usually used to demonstrate the problem called the curse of dimensionality, that is when the dimensionality of the feature space is high then the feature space is going to be very sparse and, as a result, the distances between any pair of data points are going to be very long. So for any given point in the feature space, the region(which should be a small local region ideally) that tightly bound the K nearest points to the given point is going to be very large and thus became a problem for K-NN model.

Ok, I didn't think about all these complicated stuff last year when I did it. I built the K-NN model simply because it is the easiest to code up one and I was lazy and always wanted to see how simple model works.

So it took two and a half hour(the runtime is recalculated using my newer laptop this year, the same to all the runtime in this post) to build the K-NN model on the full feature space with k = 50. And actually the result is not bad, this model will give a leader board score of 0.94743. But it is quite long to build such a single model. If we would use K-Folds Cross-Validation to choose the optimal k value, it will be days if not weeks for a handful pick of k values.

## PCA the wrong way

Last year, when I performed PCA on the dataset, I actually merged the training features and testing features into a single dataset, calculated PCA and transformed the merged data set onto the first 100 PC directions.

The percentage of variance explained by the first 100 principal components is about 91.5%, and I transform the merged data set down to 100 dimensions.

I know now this is actually cheating because I was implicitly using the information of the testing set for the model building process. I should only use the training set to get the principal component directions and then transform the training and testing data using the directions learned this way.

But anyway the result showed that we can use PCA to greatly reduce the dimension of the features without loss of much information.

Oh, I did make a plot to visualise the whole dataset using the first two principal components.

## A Random Forest fitted to the leaderboard

When building the Random Forest model last year, I actually tried the trial and error to make the leaderboard happy... That is to manually change the values of mtry and ntree and see how the Kaggle leaderboard result would change.

The score is 0.9510 which is not bad, and an improve over the K-NN model.

And this is how far I went last year( March 2014 to be specific).

## The right way to perform PCA

The right way to perform PCA on this problem is to learn PCs using the training set and project both the training set and testing set onto the first some number of principal component directions.

And things will become complicated when we do K-Folds Cross-Validation, since that we should perform PCA only on the K-1 folds to learn the directions and transform all the K folds on to that first some number of PC directions before we build model on the transformed K-1 folds data and finally gain the estimated out of sample error rate for this run, So the run time for K-Folds Cross-Validation now is even worse since we need to do PCA in each iteration.

But it maybe not that bad if training a Random Forest is not that slow, but how can we know how long it will take? We can analysis the runtime.

## Understand the runtime of Random Forest

This thanksgiving break, I wanted to know how long it will take to build a model on the full training set with a large number of trees in a Random Forest.

I did so doing some experiments on a grid of values of the number of training observations and the number of trees in the Random Forest model and tried to understand the time complexity for building a Random Forest model with the default mtry value.

I plotted two sets of plots

So based on this two plots, if we fix the number of training observations(of course mtry and depth of trees are also fixed because they are left to be the default values), then the training time is linear in the number of trees in the forest. And it is the same linear relationship between training time and the number of observations used in training if the number of trees is fixed.

I guess the runtime for the Random Forest model is O(n*ntree) if mtry and depth of trees are fixed.

Actually, I later on found a post on Quora which states that the running time is O(n*ntree*d*mtry).

So I built a regression model using the interaction term ntree*ntrain as the independent variable to predict the training time by heavily extrapolating...(which is bad for regression analysis, in general, but might be OK for runtime analysis purpose)

So the model predicts that if I want to build a Random Forest with 5000 trees on full features, it will take roughly two and a half hours. Which is about as slow as building the K-NN model, tuning the parameters using K-Folds Cross-Validation may be too slow.

## Feature Engineering

This year I also tried to create some features to help the models, I made them based on my intuition about how a human - if he does not know how to distinguish them- would find useful for this task if he is trained by other people showing them examples of digits.

firstHorizontalPix and lastHorizontalPix are to capture the row number where the first or last non-background color pixel appears in the image. Similarily, firstVerticalPix and lastVerticalPix are to capture the column number where the first or last non-background color pixel appears in the image. After randomly plotted many digits in the dataset I got the intuition that digit tends to be shorter than digit 1 in this dataset, and, of course, digit 7 tends to be fatter than digit 1. So I also counted the number of non-empty rows with numberHorizontalLines and the number of non-empty columns with numberVerticalLines.

densityUpperLeftdensityUpperRightdensityBottomLeft and densityBottomRight are created to capture how dense the pixel values are in each part of the image, And densityUpper, densityBottom, densityLeft, densityRight are just some partial sum of them. My intuition is that for digits like 8,0,1,3, the pixel value density on the upper area is going to be close to the bottom part, but it will be different for digits like 7,6,2,5,4 and 9. And it is the same kind of logic behind these sets of features I created.

symmetryHorizontal and symmetryVertical are measuring the dregree of symmetry in the image.

horizontalDiffs and verticalDiffs describes how tall and how fat each digit is.

## Feature Importance

I got the feature importance using Random Forest and here is a plot of that.

It seems that the features I created are somewhat important. To see it clearly, here is the list of the top 20 features.

## Cross-Validation with PCA

Even we are down to a little bit more than 100 features only, but tuning mtry and ntree or even depth with K-Folds Cross-Validation together with PCA in each iteration is time consuming. I coded it up, but never really run it.

Here is the code for choosing the mtry value for a fixed number of trees using such a method.

## Choosing mtry with OOB Error estimate

What I actually did is to look at the OOB Error estimates.

When each of the decision tree is built in the Random Forest, the training data for that decision is a bootstrap sample of the original dataset, on average it has 2/3 unique values from the original training set. So we can estimate the out-of-sample error rate with the 1/3 out-of-bag data. This is a valid way of measuring the out-of-sample error rate, and is much faster to do for Random Forest when compared with K-Folds Cross-Validation.

I searched a grid of mtry and ntree values using the OOB method, and generated a plot.

It seems that for higher mtry value the ntree  value required for the OOB Error starts to converge is lower. It looks to me that picking mtry = 8, ntree = 1000 is a reasonable choice.

I am not sure if the one-standard error rule would apply here or not, but after the calculation, I found that for mtry = 8 the minimum number of trees needed so that the OOB Error is no higher than one standard error from the minimum value is 500.

So in a much longer searching time, I found the combination of mtry = 8 and ntree = 500, which is almost exactly the same as the ones I hand picked last year....

The hyper parameter values are almost identical, however I made some new features which I didn't have last year, would these looked promising features really help in terms of making the leaderboard happier?

Not really... the leaderboard score drops to 0.94329.

## Closing

It may seem that I did a lot things in vain : < , however I think they are worth trying. No one can be lucky in every model he builds.

# 支持向量机（Support Vector Machine）Part I

1. 令$w$是分类平面的法向量
2. 分类器对每个数据点的所属类别的判断方法为:$h(x)=sign(w^Tx+b)$
3. 要完美地完成分类任务即等价于要：$y_{n}(w^{T}x_{n}+b)>0$
4. 每个数据点到分类平面的距离为：$frac{1}{||w||}y_{n}(w^{T}x_{n}+b)$

$underset{b,w}{max}qquad margin(b,w)$
$subject toqquad y_{n}(w^{T}x_{n}+b)>0$
$where qquad margin(b,w) = underset{n=1,dots ,N}{min}frac{1}{||w||}y_{n}(w^{T}x_{n}+b)$

$underset{b,w}{max}qquad frac{1}{||w||}$
$subject to qquad underset{n=1,dots ,N}{min}y_{n}(w^{T}x_{n}+b)=1$

$underset{b,w}{min}qquad frac{1}{2}w^Tw$
$subject to qquad y_{n}(w^{T}x_{n}+b)geq 1$

$x_1=(0,0),y_1=-1$
$x_2=(2,2),y_1=-1$
$x_1=(2,0),y_1=1$
$x_1=(3,0),y_1=1$

$underset{u}{min} qquad frac{1}{2}u^TQu+p^Tu$
$subject to qquad a_m^Tugeq c_m$
$qquad for m = 1,2,...,M$

$u = begin{bmatrix} bw end{bmatrix}$
$Q = begin{bmatrix} 0 quad 0^T_d �_d quad I_d end{bmatrix}$
$p = 0_{d+1}$

$a^T_n = y_n[1quad x^T_n]$
$c_n = 1$
$M=N$

$underset{alpha}{min}qquad frac{1}{2}sum_{n=1}^{N}sum_{m=1}^{N}{alpha}_n{alpha}_my_ny_mz_n^Tz_m-sum_{n=1}^{N}{alpha}_n$

$text{subject to}quad sum_{n=1}^{N}y_nalpha_n = 0;$

$qquad alpha_ngeq 0,quad text{for } n=1,2,3,dots ,N$

$underset{alpha}{min}qquad frac{1}{2}alpha^TQ_Dalpha-1^Talpha$

$text{subject to} y^Talpha = 0;$

$qquad alpha_ngeq 0 text{for} n=1,2,3,dots , N$

$Q_D$矩阵中：$q_{n,m}=y_ny_mz_n^Tz_m$

CVX中代码示例：

$w=sum_{SV}alpha_ny_nz_n$

$b=y_n-w^Tz_n,text{with any} SV(z_n,y_n)$

Python中利用scikit-learn的SVM模块求解支持向量机方法如下，X,y以numpy.ndarray形式存储数据和标签:

# Learning From Data Lecture 8 偏倚和偏差的权衡

### Segment 1 偏倚和偏差(Bias and Variances)

Ein是Eout的一个近似，是针对训练数据拟合的h的样本内误差。

1. 假说集合H在多大程度上能近似目标函数f
忽略样本，假设已知f和H，分析H中与f最接近的h与f之间的差距
2. 能够从假说集合中找到这个h的可能性
假设已知H中与f最近的f，根据已有的数据，有多大可能选择出这个h

• Eout(g(D))= EX[(g(D)(x)−f(x))2]

• ED[Eout(g(D))]=ED[EX[(g(D)(x)−f(x))2]]

• ED[Eout(g(D))]=EX[ED[(g(D)(x)−f(x))2]]

ED[(g(D)(x)−f(x))2]

g(x)=ED[(g(D)(x)]

g(D)(x)中x是一个给定的数据点，g(D)()是针对不同的训练数据集D挑选出的最终假说，g(D)(x)合起来是这些挑选而出的最终假说在这个给定的数据点X上的函数值ED[(g(D)(x)]是这些函数值的平均值。

ED[(g(D)(x)−f(x))2]=ED[(g(D)(x)−g(x)+g(x)-f(x))2]

=ED[(g(D)(x)−g(x))2+((g(x)-f(x))2+2(g(D)(x)−g(x))((g(x)-f(x))]

g(x)-f(x)为常数,交叉相乘项的关键在于ED[(g(D)(x)−g(x))]

• ED[(g(D)(x)−g(x))]=ED[g(D)(x)]-ED[g(x)],其中ED[g(D)(x)]=g(x)为常数ED[g(x)]=g(x),两项消除为0

• ED[(g(D)(x)−f(x))2]=ED[(g(D)(x)−g(x))2]+((g(x)-f(x))2

ED[(g(D)(x)−f(x))2]告诉我们的是针对训练数据集D，我们从假说集合H中挑选出的最终假说g与目标函数f之间的差距

1. ED[(g(D)(x)−g(x))2]告诉我们的是，我们从假说集合H中挑选出的最终假说g与可能最好的平均假说g之间的差距
2. ((g(x)-f(x))2是这个可能最佳的平均假说g与目标函数f之间的差距。

1. 偏差为(var(X))：ED[(g(D)(x)−g(x))2]：是根据训练数据集D选出的最终假说g与平均假说g之间的距离，揭示的是给训练数据集D的影响
2. 偏倚(bias(X))为：((g(x)-f(x))2： 是假说集合中最好的情况与目标函数之间针对某一个数据点的差距，揭示的是假说集合本身的影响

Eout(g(D))= EX[(g(D)(x)−f(x))2]  = EX[bias(X)+var(X)]

• 偏倚：EX[bias(X)]=EX[((g(x)-f(x))2]
• 偏差：EX[var(X)]=EX[ED[(g(D)(x)−g(x))2]]

H0为常数，h(x)=b

H1为线性一次函数，h(x)=ax+b

g(x)是我们希望能选为最终假说的输出结果，但不一定能够选到它，这依赖于给定的数据集D。

g(x)是我们希望能选为最终假说的输出结果，但不一定能够选到它，这依赖于给定的数据集D。

H0与H1相比，偏差较小，偏倚较大，下面是两者的具体数值。

### Segment 2 学习曲线(Learning Curve)

• 蓝色区域代表样本内误差
• 对于给定的N，做一条垂线，其在两条曲线上交点值之间的距离与一般化误差Ω成正比

• 蓝色区域代表的是偏倚，即g(x)与f（X轴）之间的差距
• 红色区域代表的是偏差

• σ2是样本外误差所能达到的最佳误差情况
• 样本内误差的期望为：σ2(1-(d+1)/N)
• 样本外误差的期望为：σ2(1+(d+1)/N)
• 一般化误差的期望为：2σ2((d+1)/N)

d是假说集合的自由度。

http://work.caltech.edu/telecourse.html