Tag Archives: Machine Learning

Repost of my KDD Cup 2018 summary

I finished 30th place at this year's KDD CUP. I still remember back to 2015, when I was very rusty with coding and tried to attempt that years' KDD cup with my potato laptop Lenovo U310. I did not know what I was doing, all I did is trying to throw data into XGBoost and my performance then is a joke. I see myself became more and more capable of comming up with ideas and implement them out during these two years. And below is a repost of my summary to KDD 2018.

Hooray~! fellow KDD competitors. I entered this competition on day 1 and very quickly established a reasonable baseline. Due to some personal side of things, I practically stopped improving my solutions since the beginning of May. Even though my methods did not work really well compared to many top players in phase 2, but I think my solution may worth sharing due to it is relative simplicity. I did not touch the meo data at all, and one of my models is just calculating medians.

Alternative data source

For new hourly air quality data, as shared in the forum, I am using this for London and this for Beijing instead of the API from the organizer.

Handling missing data

I filled missing values in air quality data with 3 steps:

  1. Fill missing values for a station-measure combo based on the values from other stations.
    To be specific: I trained 131 lightgbm regressors for this. If PM2.5 reading on 2:00 May 20th is missing for Beijing aotizhongxin station, the regressor aotizhongxin_aq-PM2.5 will predict this value based on known PM2.5 readings on 2:00 May 20th from 34 other stations in Beijing.
    I used thresholds to decide whether to do this imputation or not. If more than the threshold number of stations also don't have a reading, then skip this step.
  2. Fill the remaining missing values by looking forward and backward to find known values.
  3. Finally, replace all remaining missing values by overall mean value.

Approaches

1. median of medians

This is a simple model that worked reasonably well in this Kaggle competition.

To predict PM2.5 reading on 2:00 May 20th for aotizhongxin, look back for a window of days history, calculating the median 2:00 PM2.5 readings from aotizhongxin in that window. You do this median calculation exercise for a bunch of different window sizes to obtain a bunch medians. The median value of those medians is used as the prediction.

Intuitively this is just an aggregated yesterday once more. With more larger windows in the collection, the model memorizes the long-term trend better. The more you add in smaller windows, the quicker the model would respond to recent events.

2. facebooks' prophet

This is practically even simpler than the median of medians. I treated the number of days history I throw at it and the model parameters changepoint_prior_scalen_changepoints as main hyperparameters and tweaked them. I did a bit work to parallelizing the fitting process for all the station-measure combos to speed up the fitting process, other than that, it is pretty much out of the box.

I tried to use holiday indicator or tweaking other parameters of the model and they all degrade the performance of the model.

3. neural network

My neural network is a simple feed-forward network with a single shortcut, shamelessly copied the structure from a senior colleague's Kaggle solution with tweaked hidden layer sizes.
The model looks like this:
nn_plot

The input to the neural network are concatenated (1) raw history readings, (2) median summary values from different window_sizes, and (3) indicator variables for the city, type of measure.

The output layer in the network is a dense layer with 48 units, each corresponding to an hourly reading in the next 48 hours.

The model is trained directly using smape as loss function with Adam optimizer. I tried standardizing inputs into zero mean and unit variance, but it will cause a problem when used together with smape loss, thus I tried switching to a clipped version MAE loss, which produced similar results compared to raw input with smape loss.

The model can be trained on CPU only machine in very short time.

I tried out some CNN, RNN models but couldn't get them working better than this simple model, and had to abandon them.

Training and validation setup

This is pretty tricky, and I am still not quite sure if I have done it correctly or not.

For approach 1 and 2

I tried to generate predictions for a few historical months, calculating daily smape scores locally. Then sample 25 days out to calculate a mean smape score. Do this sample-scoring a large number of times and take mean as local validation score. I used this score to select parameters.

For neural network

I split the history data into (X, y) pairs based on a splitting day, and then move the splitting day backward by 1 day to generate another (X, y) pair. Do this 60 times and vertically concatenate them to form my training data.

I used groupedCV split on the concatenated dataset to do cross-validation so that measures from one station don't end up in both training and validation set. During training, the batch size is specified so that data in the batch all based on the same splitting day. I did this trying to preventing information leaking.

I got average smape scores 0.40~44 for Beijing and 0.30-0.34 for London in my local validation setting. Which I think is pretty aligned with how it averages out through May.

Closing

Without utilizing any other weather information or integrating any sort of forecasts, all my models failed miserably for events like the sudden peak on May 27th in Beijing.

A quick post on Decision Trees

Today is my one year anniversary studying at Bentley University in MA, US. And this short post on this special day is devoted to Decision Trees, a simple but often misused method. During my short time here, and limited times talking to marketing analysts, I saw some cases when they just take the rules directly from decision trees and throw them in decks and say that they will show that to executives as "insights". I think they should be a bit more careful since the decision tree can be rather unstable, i.e. small changes in the dataset can result in completely different trees.

Below are some python scripts, which i used to build classification trees on the iris data. But each time, I do some sort of random sampling and feed the sample to the algorithm to build trees. You can look at the difference  yourself.

I will just upload pictures of a few of these trees.

iris_dt_entropy_70_percent_train_seed_36 iris_dt_entropy_70_percent_train_seed_1106

The two trees above are built on different random samples of the iris dataset. From the first one you can get the rule: if petal length is less than or equal to 2.45 cm then the flower is a setosa. Wheras from the second decision tree you get the rule: if petal width is less than or equal to 0.8 cm then the flower is a setosa. Take a quick look at the scatter plot you will see that they are both pretty good rules, and if you only build one decision tree, you will miss one of the very good rules.

colored_scatter_plot_iris

Below are two trees built on different samples of the iris dataset again. And I personly would not try to interpret the splitting rules from the third level in the tree and on (take root node as the first level).



iris_dt_entropy_num_per_class_15_45_5 iris_dt_entropy_num_per_class_15_45_10

Anyway, the takeaway is that: try different algorithms, try different samples as input, build many trees and take what's common from them, rather than simply build one tree.

Kaggle Digit Recognizer Revisited (Using R)

About the competition

Digit Recognizer is a competition that has been hosted on Kaggle for years( almost three and a half years so far?). The data for this competition were taken from the famous MNIST dataset, which has been extensively studied within the machine learning community( here is more information on that).

I tried the competition back a year ago with very naive method( basically a wrong way of performing PCA and Random Forest with hand picked hyperparameter values), and recently during the thanksgiving break, I got some time and revisited the competition to see if I can improve on my past results.

About the data set

The training and testing datasets are all .csv files. Not big, but using fread() function from data.table package can be a bit faster.

Look at the dimensions of the data set

The number of observations in the training set is n = 42000 and the dimension of the feature space  is p = 784.

Looked at these numbers, I thought that the dimension of features is somewhat high and we might want to perform PCA on it.

Note, in the textbook, high dimensional problems refer to the situations when p is greater than n, so not the case here technically.

Each observation represents a 28 by 28 pixels grey scale image, and each cell value is an integer between 0 and 255. Our task is to build a model to predict what digit(0 to 9) each image is representing.

The labels are pretty balanced in the training set.

Visualise the images in the dataset

We need to reshape the observations(each is a single row with 784 values) back into a 28 by 28 matrix in order to display the images.

We can do so by writing a simple function to perform that transformation.

unnamed-chunk-8-1

Here is another observation:

unnamed-chunk-8-2

Try a K-NN model first

Before directly go and perform PCA, I tried a K-NN model on the dataset first.

The K-NN model is simply to look at the K points nearest to the one we want to make a prediction on by in some sense taking an average of the K nearest neighbors' value as the predicted value.

The model is usually used to demonstrate the problem called the curse of dimensionality, that is when the dimensionality of the feature space is high then the feature space is going to be very sparse and, as a result, the distances between any pair of data points are going to be very long. So for any given point in the feature space, the region(which should be a small local region ideally) that tightly bound the K nearest points to the given point is going to be very large and thus became a problem for K-NN model.

Ok, I didn't think about all these complicated stuff last year when I did it. I built the K-NN model simply because it is the easiest to code up one and I was lazy and always wanted to see how simple model works.

So it took two and a half hour(the runtime is recalculated using my newer laptop this year, the same to all the runtime in this post) to build the K-NN model on the full feature space with k = 50. And actually the result is not bad, this model will give a leader board score of 0.94743. But it is quite long to build such a single model. If we would use K-Folds Cross-Validation to choose the optimal k value, it will be days if not weeks for a handful pick of k values.

PCA the wrong way

Last year, when I performed PCA on the dataset, I actually merged the training features and testing features into a single dataset, calculated PCA and transformed the merged data set onto the first 100 PC directions.

unnamed-chunk-17-1

The percentage of variance explained by the first 100 principal components is about 91.5%, and I transform the merged data set down to 100 dimensions.

I know now this is actually cheating because I was implicitly using the information of the testing set for the model building process. I should only use the training set to get the principal component directions and then transform the training and testing data using the directions learned this way.

But anyway the result showed that we can use PCA to greatly reduce the dimension of the features without loss of much information.

Oh, I did make a plot to visualise the whole dataset using the first two principal components.

unnamed-chunk-22-1

A Random Forest fitted to the leaderboard

When building the Random Forest model last year, I actually tried the trial and error to make the leaderboard happy... That is to manually change the values of mtry and ntree and see how the Kaggle leaderboard result would change.

The score is 0.9510 which is not bad, and an improve over the K-NN model.

And this is how far I went last year( March 2014 to be specific).

The right way to perform PCA

The right way to perform PCA on this problem is to learn PCs using the training set and project both the training set and testing set onto the first some number of principal component directions.

And things will become complicated when we do K-Folds Cross-Validation, since that we should perform PCA only on the K-1 folds to learn the directions and transform all the K folds on to that first some number of PC directions before we build model on the transformed K-1 folds data and finally gain the estimated out of sample error rate for this run, So the run time for K-Folds Cross-Validation now is even worse since we need to do PCA in each iteration.

But it maybe not that bad if training a Random Forest is not that slow, but how can we know how long it will take? We can analysis the runtime.

Understand the runtime of Random Forest

This thanksgiving break, I wanted to know how long it will take to build a model on the full training set with a large number of trees in a Random Forest.

I did so doing some experiments on a grid of values of the number of training observations and the number of trees in the Random Forest model and tried to understand the time complexity for building a Random Forest model with the default mtry value.

I plotted two sets of plots

unnamed-chunk-11-1

unnamed-chunk-12-1

So based on this two plots, if we fix the number of training observations(of course mtry and depth of trees are also fixed because they are left to be the default values), then the training time is linear in the number of trees in the forest. And it is the same linear relationship between training time and the number of observations used in training if the number of trees is fixed.

I guess the runtime for the Random Forest model is O(n*ntree) if mtry and depth of trees are fixed.

Actually, I later on found a post on Quora which states that the running time is O(n*ntree*d*mtry).

So I built a regression model using the interaction term ntree*ntrain as the independent variable to predict the training time by heavily extrapolating...(which is bad for regression analysis, in general, but might be OK for runtime analysis purpose)

So the model predicts that if I want to build a Random Forest with 5000 trees on full features, it will take roughly two and a half hours. Which is about as slow as building the K-NN model, tuning the parameters using K-Folds Cross-Validation may be too slow.

Feature Engineering

This year I also tried to create some features to help the models, I made them based on my intuition about how a human - if he does not know how to distinguish them- would find useful for this task if he is trained by other people showing them examples of digits.

firstHorizontalPix and lastHorizontalPix are to capture the row number where the first or last non-background color pixel appears in the image. Similarily, firstVerticalPix and lastVerticalPix are to capture the column number where the first or last non-background color pixel appears in the image. After randomly plotted many digits in the dataset I got the intuition that digit tends to be shorter than digit 1 in this dataset, and, of course, digit 7 tends to be fatter than digit 1. So I also counted the number of non-empty rows with numberHorizontalLines and the number of non-empty columns with numberVerticalLines.

densityUpperLeftdensityUpperRightdensityBottomLeft and densityBottomRight are created to capture how dense the pixel values are in each part of the image, And densityUpper, densityBottom, densityLeft, densityRight are just some partial sum of them. My intuition is that for digits like 8,0,1,3, the pixel value density on the upper area is going to be close to the bottom part, but it will be different for digits like 7,6,2,5,4 and 9. And it is the same kind of logic behind these sets of features I created.

symmetryHorizontal and symmetryVertical are measuring the dregree of symmetry in the image.

horizontalDiffs and verticalDiffs describes how tall and how fat each digit is.

Feature Importance

I got the feature importance using Random Forest and here is a plot of that.

featureImportance

It seems that the features I created are somewhat important. To see it clearly, here is the list of the top 20 features.

Cross-Validation with PCA

Even we are down to a little bit more than 100 features only, but tuning mtry and ntree or even depth with K-Folds Cross-Validation together with PCA in each iteration is time consuming. I coded it up, but never really run it.

Here is the code for choosing the mtry value for a fixed number of trees using such a method.

Choosing mtry with OOB Error estimate

What I actually did is to look at the OOB Error estimates.

When each of the decision tree is built in the Random Forest, the training data for that decision is a bootstrap sample of the original dataset, on average it has 2/3 unique values from the original training set. So we can estimate the out-of-sample error rate with the 1/3 out-of-bag data. This is a valid way of measuring the out-of-sample error rate, and is much faster to do for Random Forest when compared with K-Folds Cross-Validation.

I searched a grid of mtry and ntree values using the OOB method, and generated a plot.

oob

It seems that for higher mtry value the ntree  value required for the OOB Error starts to converge is lower. It looks to me that picking mtry = 8, ntree = 1000 is a reasonable choice.

I am not sure if the one-standard error rule would apply here or not, but after the calculation, I found that for mtry = 8 the minimum number of trees needed so that the OOB Error is no higher than one standard error from the minimum value is 500.

So in a much longer searching time, I found the combination of mtry = 8 and ntree = 500, which is almost exactly the same as the ones I hand picked last year....

The hyper parameter values are almost identical, however I made some new features which I didn't have last year, would these looked promising features really help in terms of making the leaderboard happier?

Not really... the leaderboard score drops to 0.94329.

Closing

It may seem that I did a lot things in vain : < , however I think they are worth trying. No one can be lucky in every model he builds.

支持向量机(Support Vector Machine)Part I

支持向量机用于解决分类问题。与感知器(PLA)相比,支持向量机不仅试图找到一个能够完美分割数据的高维平面(hyperplane),而且还希望这个平面能够尽量地胖,即平面与最近的数据点之间的距离尽可能地要大。

fatHyperPlane

给人的直观感觉是:分类平面离最近的数据点越远,越能容忍误差,因而鲁棒性越好。

1. 令w是分类平面的法向量
2. 分类器对每个数据点的所属类别的判断方法为:h(x)=sign(w^Tx+b)
3. 要完美地完成分类任务即等价于要:y_{n}(w^{T}x_{n}+b) data-recalc-dims=0" />
4. 每个数据点到分类平面的距离为:frac{1}{||w||}y_{n}(w^{T}x_{n}+b)

定义支持向量机问题如下:
underset{b,w}{max}qquad margin(b,w)
subject toqquad y_{n}(w^{T}x_{n}+b) data-recalc-dims=0" />
where qquad margin(b,w) = underset{n=1,dots ,N}{min}frac{1}{||w||}y_{n}(w^{T}x_{n}+b)
因为通过对b,w进行同等的放缩对限制条件并无影响,不妨令underset{n=1,dots ,N}{min}y_{n}(w^{T}x_{n}+b)=1
即可对上面问题进行简化:

underset{b,w}{max}qquad frac{1}{||w||}
subject to qquad underset{n=1,dots ,N}{min}y_{n}(w^{T}x_{n}+b)=1

依旧可以通过放缩对限制条件进行放松,同时将问题中的最大化专为最小化,获得了支持向量机的标准形式:

underset{b,w}{min}qquad frac{1}{2}w^Tw
subject to qquad y_{n}(w^{T}x_{n}+b)geq 1

这是一个二次规划问题,包含d+1个变量,N个限制条件

下面举个例子,在matlab中,利用cvx求解一个支持向量机,有四个数据点:
x_1=(0,0),y_1=-1
x_2=(2,2),y_1=-1
x_1=(2,0),y_1=1
x_1=(3,0),y_1=1
现要求解满足条件的支持向量机,代码如下:

求解后获得b = -1,w = [1;-1]

支持向量机的标准形式包含的是一个二次的目标函数和线性的限制条件,是一个二次规划问题,可以通过二次规划求解。

等价的二次规划问题形式如下:
underset{u}{min} qquad frac{1}{2}u^TQu+p^Tu
subject  to qquad a_m^Tugeq c_m
qquad for  m = 1,2,...,M

其中,目标中:
u = begin{bmatrix} bw end{bmatrix}
Q = begin{bmatrix} 0 quad 0^T_d �_d quad I_d end{bmatrix}
p = 0_{d+1}
限制条件为:
a^T_n = y_n[1quad x^T_n]
c_n = 1
M=N

下面仍然用cvx来求解上面例子中问题的二次规划形式,同时,我们对x进行一个非线性的变换:

求解得到:u = begin{bmatrix} -92� end{bmatrix}

即:b = -9, w =[2;0]

 

博主对于朗格朗日乘子和KKT条件不是懂,据说将上面的支持向量机标准形式转变为对应的对偶问题能够提高求解速度,支持向量机的对偶问题形式如下:

underset{alpha}{min}qquad frac{1}{2}sum_{n=1}^{N}sum_{m=1}^{N}{alpha}_n{alpha}_my_ny_mz_n^Tz_m-sum_{n=1}^{N}{alpha}_n

text{subject to}quad sum_{n=1}^{N}y_nalpha_n = 0;

qquad alpha_ngeq 0,quad text{for } n=1,2,3,dots ,N

对偶问题仍然是二次规划问题,有N个变量和N+?个限制条件,其中的alpha为拉格朗日乘数。

其解法为:

underset{alpha}{min}qquad frac{1}{2}alpha^TQ_Dalpha-1^Talpha

text{subject to} y^Talpha = 0;

qquad alpha_ngeq 0 text{for}  n=1,2,3,dots , N

Q_D矩阵中:q_{n,m}=y_ny_mz_n^Tz_m

CVX中代码示例:

求解出a为[0.0000;0.7037;0.7037;0.8889;0.2593;0.2593;0.0000],表示第2~第6个数据点为支持向量。相应的$b,w$求法如下:

w=sum_{SV}alpha_ny_nz_n

b=y_n-w^Tz_n,text{with any} SV(z_n,y_n)

Python中利用scikit-learn的SVM模块求解支持向量机方法如下,X,y以numpy.ndarray形式存储数据和标签:

本节中介绍的都是“硬”的支持向量机,Part II中会介绍“软”的支持向量机。

Learning From Data Lecture 8 偏倚和偏差的权衡

Segment 1 偏倚和偏差(Bias and Variances)

权衡的定量化:

由VC理论知:Eout<=Ein

Ein是Eout的一个近似,是针对训练数据拟合的h的样本内误差。

加上一般化误差Ω,这个式子告诉我们由样本内推广到样本外会如何。

偏倚和偏差分析是另一种途径,尝试将误差分解为两个部分,即:

  1. 假说集合H在多大程度上能近似目标函数f
    忽略样本,假设已知f和H,分析H中与f最接近的h与f之间的差距
  2. 能够从假说集合中找到这个h的可能性
    假设已知H中与f最近的f,根据已有的数据,有多大可能选择出这个h

下面的讨论中,将采用实值的目标函数,并用平方误差作为误差测量。

从Eout开始:

通过给定的训练数据集D,挑选出的最终假说g的样本外误差表达式为:

  • Eout(g(D))= EX[(g(D)(x)−f(x))2]

明显地,样本外误差的值依赖于所给定的训练数据集D。

假设现在只限定训练数据集D的样本数量为N,允许生成多个这样的训练数据集,则针对这些不同的训练数据集,可以获得不同的最终假说以及相应的样本外误差。

想要了解与某一给定训练数据集D无关的一般化的样本外误差,在等式两边都加上样本的期望:

  • ED[Eout(g(D))]=ED[EX[(g(D)(x)−f(x))2]]

根据期望迭代定律有:

  • ED[Eout(g(D))]=EX[ED[(g(D)(x)−f(x))2]]

下面主要讨论右边里面只与数据集D有关的这一项:

ED[(g(D)(x)−f(x))2]

平均假说(average hypothesis):

平均假说的定义:

对于一个特定的假说集合H,定义平均假说g(x)为:

g(x)=ED[(g(D)(x)]

g(D)(x)中x是一个给定的数据点,g(D)()是针对不同的训练数据集D挑选出的最终假说,g(D)(x)合起来是这些挑选而出的最终假说在这个给定的数据点X上的函数值ED[(g(D)(x)]是这些函数值的平均值。

如果训练数据集D的数量非常大,那么这个平均期望可能可以非常接近f。

QQ截图20131231153043

下面对ED[(g(D)(x)−f(x))2]进行展开:

ED[(g(D)(x)−f(x))2]=ED[(g(D)(x)−g(x)+g(x)-f(x))2]

=ED[(g(D)(x)−g(x))2+((g(x)-f(x))2+2(g(D)(x)−g(x))((g(x)-f(x))]

考虑到所求的是针对D的期望:

g(x)-f(x)为常数,交叉相乘项的关键在于ED[(g(D)(x)−g(x))]

由差的期望等于期望的差知:

  • ED[(g(D)(x)−g(x))]=ED[g(D)(x)]-ED[g(x)],其中ED[g(D)(x)]=g(x)为常数ED[g(x)]=g(x),两项消除为0

因此得到:

  • ED[(g(D)(x)−f(x))2]=ED[(g(D)(x)−g(x))2]+((g(x)-f(x))2

尝试理解这个式子:

ED[(g(D)(x)−f(x))2]告诉我们的是针对训练数据集D,我们从假说集合H中挑选出的最终假说g与目标函数f之间的差距

我们将其分解为两个部分:

  1. ED[(g(D)(x)−g(x))2]告诉我们的是,我们从假说集合H中挑选出的最终假说g与可能最好的平均假说g之间的差距
  2. ((g(x)-f(x))2是这个可能最佳的平均假说g与目标函数f之间的差距。

偏倚和偏差分别对应于这两者:

  1. 偏差为(var(X)):ED[(g(D)(x)−g(x))2]:是根据训练数据集D选出的最终假说g与平均假说g之间的距离,揭示的是给训练数据集D的影响
  2. 偏倚(bias(X))为:((g(x)-f(x))2: 是假说集合中最好的情况与目标函数之间针对某一个数据点的差距,揭示的是假说集合本身的影响

因此简化最一开始提出的样本外误差表达式:

Eout(g(D))= EX[(g(D)(x)−f(x))2]  = EX[bias(X)+var(X)]

偏差和偏倚权衡:

  • 偏倚:EX[bias(X)]=EX[((g(x)-f(x))2]
  • 偏差:EX[var(X)]=EX[ED[(g(D)(x)−g(x))2]]

以两张图来尝试说明偏倚和偏差之间的权衡

QQ截图20131231154811

上图代表假说集合H的复杂程度较低,假说集合中最好的情况离目标函数的距离可能较大,但是偏差可能较小。

QQ截图20131231154819

上图代表假说集合H的复杂程度较高,假说集合中最好的情况离目标函数的距离可能较小(甚至包含在其内),但是偏差可能较大。

告诉我们:

随着假说集合H的复杂度上升,偏倚减小,偏差增大。

举例说明:

假设目标函数是 f:[-1,+1] –> R, f(x)=sin(πx).

假设我们的训练数据集只含两个数据N=2, D={(x1,y1),(x2,y2)}

假设我们有两个假说集合:

H0为常数,h(x)=b

H1为线性一次函数,h(x)=ax+b

对于H0,挑选g(x)=(y1+y2)/2可以是Ein最小

QQ截图20131231155842

对于H1中挑选g(x)=(y1+y2)/(x1-x2)x+(x1y2-x2y1)/(x1-x2)可以使Ein最小

QQ截图20131231155852

现在要比较这两个假说集合的样本外误差Eout

就相当于是看获得的直线与目标函数对应的曲线之间区域的面积:

QQ截图20131231160801

下面考虑两个假说集合中平均假说分别是什么:

对于H0

下图左边表示的是反复抽取2个数据构成训练数据集D并拟合出的g()的分布情况,右边的绿色直线表示的是这些假说的平均,即g(x)。

g(x)是我们希望能选为最终假说的输出结果,但不一定能够选到它,这依赖于给定的数据集D。

下图右边的阴影部分是正负一个标准差的范围,因此:

蓝色曲线与绿色直线之间的差别代表偏倚,而阴影部分代表偏差。

QQ截图20131231212436

对于H1

下图左边表示的是反复抽取2个数据构成训练数据集D并拟合出的g()的分布情况,右边的红色直线表示的是这些假说的平均,即g(x)。

g(x)是我们希望能选为最终假说的输出结果,但不一定能够选到它,这依赖于给定的数据集D。

下图右边的阴影部分是正负一个标准差的范围,因此:

蓝色曲线与红色直线之间的差别代表偏倚,而阴影部分代表偏差。

QQ截图20131231213519

比较这两个假说集合,可以知道:

H0与H1相比,偏差较小,偏倚较大,下面是两者的具体数值。

QQ截图20131231213917

因此,在给定一个只含两个数据点的训练数据集D的情况下,赢家是H0

给我们的教训:

在机器学习问题中,我们是在模型复杂程度与训练数据集之间做匹配,而非在模型复杂程度与目标函数复杂程度之间做匹配。


Segment 2 学习曲线(Learning Curve)

学习曲线是在一张图表上同时绘制出样本内误差和样本外误差。

下图揭示的是简单模型和复杂模型的学习曲线。

QQ截图20131231214451

对于简单的模型,样本内外误差之间的差距较小,随着训练数据集样本量的增加,最终能达到的样本内误差不低,误差趋近于收敛的速度较快。

对于复杂的模型,样本内外误差之间的差距较大,随着训练数据集样本量的增加,最终能达到的样本内误差较低,误差趋近于收敛的速度较慢。

解读复杂模型的学习曲线:对于样本量较少的情况,复杂的模型是“记忆”住了样本,因此对样本外数据的误差大,只有在样本量足够大的时候,才能打破这个“偏见”而得以能够推广到样本外。

偏倚、偏差分析与VC分析告诉我们的结果是类似的。

QQ截图20131231215404

上图左边是基于VC理论的分析:

  • 蓝色区域代表样本内误差
  • 对于给定的N,做一条垂线,其在两条曲线上交点值之间的距离与一般化误差Ω成正比

上图右边是基于偏倚和偏差的分析:

  • 蓝色区域代表的是偏倚,即g(x)与f(X轴)之间的差距
  • 红色区域代表的是偏差

两种分析都考虑了样本内外误差,区别在于:

基于VC理论的分析中样本内外误差都随着样本量而改变

基于偏倚、偏差的分析中偏倚是常数,由假说集合决定,而偏差则由假说集合以及样本量共同决定

比较容易能理解为什么基于偏倚、偏差的分析的学习曲线的结构,下面举例解释基于VC理论的学习曲线结构。

最后以线性回归做示例:

假设我们尝试去学习的是一个有噪声的线性目标函数:y=wTX+noise

训练数据集为:D = {(x1,y1),…,(xn,yn)}

利用线性回归的正规方程解法,最终假说g中 w=(XTX)-1XTy

样本内误差Ein=Xw-y

对于同样的数据X,利用不同的noise生成y’用于估计样本外误差Eout=Xw-y’

QQ截图20131231221533

由此获得的学习曲线如上图所示。

其中:

  • σ2是样本外误差所能达到的最佳误差情况
  • 样本内误差的期望为:σ2(1-(d+1)/N)
  • 样本外误差的期望为:σ2(1+(d+1)/N)
  • 一般化误差的期望为:2σ2((d+1)/N)

d是假说集合的自由度。

一般化误差的期望值表达式也是上节中对曲线解读的来源。

课程地址:

http://work.caltech.edu/telecourse.html