My Solution to the Analytic Edge Kaggle Competition

The Kaggle competition for the MIT's course The Analytic Edge on edX is now over.

Here is my solution to the competition, it ranked 244 on the final leader board, which is by no means great but still rank within top 10%.

My approach is to minimizing the effort to do feature engineering by hand. Instead using a couple of automated methods to extracting features from the text in the dataset. For the modeling process, I used the Xgboost package, using cross-validation to pick the number of iterations to avoid overfit to the training set.

0 House Keeping

Set up working directory 设定工作环境

Load Libraries 装载函数包

Function Definition

Function for dummy encoding 用于生成虚拟编码的函数

1 Data Preparing 数据准备工作

Loading Data

Imputing missing categorical data 填补下缺失数据
I used a really simple approach, after inspecting the correlation among them.

Remove data entries which has a NewsDesk value not appeared in the testing data

Change "" to "Other", No effect on modeling, just don't like to see ""

Log Transform "WordCount" 将字数做个对数转变

2 Feature Extraction 提取特征

QR = Question Mark in the title?

Extract Hour and day in week

Extract all headline and abstract to form a corpus

Corpus processing

Document ~ TF-IDF matrix And Document ~ TF matrix

构建文档~TF-IDF矩阵 以及文档~TF矩阵

Get frequent terms matrix as feature

Clustering  聚类

PCA 主要成分分析

LDA  潜在狄利克雷分配

Dummy Encoding 虚拟编码

3 Model Fitting 模型拟合

Using cross validation to pick number of rounds  for Xgboost



Leave a Reply

Your email address will not be published. Required fields are marked *