Category Archives: Data Mining

Kaggle Digit Recognizer Revisited (Using Convolutional NN with Keras)

Almost a year ago, I revisited the Kaggle version of the Hand Written Digit Recognition problem, the link to that post is here. At that time, my go to language is R, since the majority of friends around me use R as well. This summer, I evidently switched back to use python as my primary language to do almost everything, because it is just so efficient.

So, here is a convolutional neural network using Keras to tackle this problem again, in less than 100 lines of code you can get a convolutional neural network and obtain 99% accuracy on the Kaggle leaderboard.

A quick note about training time, it took close to 9 minutes to be trained on my laptop with GeForce GTX 970M chip. You can increase the number of epochs and run it by yourself, it should be able to lead to better results.

 

A Logistic Regression Benchmark for Red Hat Customer Business Value Prediction Problem

Red Hat put out a competition on Kaggle asking people to build models to predict customer potential. It is a simple binary classification problem and the metric to this problem that Red Hat wanted to determine which model rank best is the AUC score.

I am sort of late in participating in this competition, and there are only 7 days to go. I sketched a rather simple logistic regression model, and it ranks somewhere in the middle among 2,200 teams in total. Kind of surprised to see that a simple logistic regression can beat half of the participants.

My model uses all the features and I find out the penalty strength parameter C should take on value 10.

Below is my code:

A quick post on Decision Trees

Today is my one year anniversary studying at Bentley University in MA, US. And this short post on this special day is devoted to Decision Trees, a simple but often misused method. During my short time here, and limited times talking to marketing analysts, I saw some cases when they just take the rules directly from decision trees and throw them in decks and say that they will show that to executives as "insights". I think they should be a bit more careful since the decision tree can be rather unstable, i.e. small changes in the dataset can result in completely different trees.

Below are some python scripts, which i used to build classification trees on the iris data. But each time, I do some sort of random sampling and feed the sample to the algorithm to build trees. You can look at the difference  yourself.

I will just upload pictures of a few of these trees.

iris_dt_entropy_70_percent_train_seed_36 iris_dt_entropy_70_percent_train_seed_1106

The two trees above are built on different random samples of the iris dataset. From the first one you can get the rule: if petal length is less than or equal to 2.45 cm then the flower is a setosa. Wheras from the second decision tree you get the rule: if petal width is less than or equal to 0.8 cm then the flower is a setosa. Take a quick look at the scatter plot you will see that they are both pretty good rules, and if you only build one decision tree, you will miss one of the very good rules.

colored_scatter_plot_iris

Below are two trees built on different samples of the iris dataset again. And I personly would not try to interpret the splitting rules from the third level in the tree and on (take root node as the first level).



iris_dt_entropy_num_per_class_15_45_5 iris_dt_entropy_num_per_class_15_45_10

Anyway, the takeaway is that: try different algorithms, try different samples as input, build many trees and take what's common from them, rather than simply build one tree.

K-Means 聚类

预先选择聚集数量k。

  1. 初始化k个中心点
  2. 每一次迭代时:
    将每个数据点分配给与其“最近”的中心点
    对每一个中心点,更新其新的位置为所有属于其的节点的“平均位置"
  3. 重复1-2 直至收敛(一次迭代后,没有数据点被重新分配过)

仍旧是之前层次聚类时用到的例子数据:

得到2个聚集,聚集1为0,1,2三个数据点,聚集2为3,4,5三个数据,看下可视化结果,看看这个聚类是否有直觉上的意义。

红色点为聚集的中心位置

figure_1

用scikit learn的kmeans,对例子数据聚类:

结果是一致的。

如何选择k?

  • 尝试不同的k,看各个数据点到所属中心点的距离。
    如果k较少,则会有许多较大的距离
    如果k较大,则距离通常非常小

如何选择一开始的k个中心点?

  1. 抽样
    抽样一部分数据点,使用层次聚类,找出k的值
    从层次聚类的k个类中各取一个点作为中心点
  2. 先随机选一个点作为中心点
    再剩下点中,找到一个与已选点距离最远的点作为中心点
    反复上述过程直至获得k个点

Hierarchical Clustering 层次聚类

从下至上的层次聚类:

  1. 初始时,每一个数据点单独构成一个聚集
  2. 不断地将最近的两个聚集合并

需要决定:

  1. 聚集之间的距离度量
  2. 定义聚集合并后如何表示
  3. 何时停止聚类

常见方法:

  • 用中心点表示聚集,中心点与聚集内各个节点“最近”(平均距离最小、最大距离最小,距离的平方和最小等等)
  • 用聚集于聚集的中心点之间的距离表示聚集之间的距离

例子:

6个数据点:(1,2),(2,1),(0,0),(5,3),(5,0),(4,1)

输出的是每一轮合并聚集后,新聚集内包括的数据点

scipy中有层次聚类,对例子数据的聚类结果与上面一直。

figure_1

聚类的终止标准:

  1. 预先设定聚集数量k,一旦形成k个聚集就停止
  2. 当新生成的聚集的“凝聚力”(cohesion)小于阈值

凝聚力:

  1. 聚集的直径 = 全局中节点之间的最大距离
  2. 聚集内节点与中心点之间的距离超过预先设定的最大值
  3. 聚集内的节点密度大于预先设定值