Tag Archives: Classification Tree

The Analytic Edge Lecture code in Python Week4 Supreme Court

VIDEO 4

Read in the data

 

CART Model

Proposed formula: Reverse ~ Circuit + Issue + Petitioner + Respondent + LowerCourt + Unconst

Notice all predictors, except Unconst , are categorical data

SO even before splitting the dataset, we need to do some extra work

Docket Term Circuit Issue Petitioner Respondent LowerCourt Unconst Reverse
0 93-1408 1994 2nd EconomicActivity BUSINESS BUSINESS liberal 0 1
1 93-1577 1994 9th EconomicActivity BUSINESS BUSINESS liberal 0 1
2 93-1612 1994 5th EconomicActivity BUSINESS BUSINESS liberal 0 1
3 94-623 1994 1st EconomicActivity BUSINESS BUSINESS conser 0 1
4 94-1175 1995 7th JudicialPower BUSINESS BUSINESS conser 0 1

Extra work for Python

Encoding categorical predictors using one-hot encoding

First : Create a dictionary with the categorical data points for each row

Second: Transform our dictionary to a binary on-hot encoded array for each row

Third: Construct a separate dataframe with the one-hot encoded data and name the columns

Finally: Construct the transformed dataset

Split the data

CART model

The tree algorithm implemented by sci-kit learn is CART. Set the minimal number of data points in each node to be 25

Plot the tree. For this part, you need to install graphviz on your machine and the path variable is added.

Look at the tree

png

Make predictions

ROC curve

We need real valued prediction output to get ROC curve.

png

VIDEO 5 - Random Forests

Make predictions

VIDEO 6

There is no complexitity parameter cp for CART model in scikit-learn. We got the min_sample_leaf = 25 from the lecturer, are there other options beside that? I will use Cross Validation to choose a min_samples_leaf to train model.

Got a different value

分类树模型用于预测手机用户的行为

数据来源是Jorge L.等人收集的“用智能手机识别人类行为“数据集,原始数据地址:

http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones

实际上用于分析的是Jeff Leek预处理过的数据:

https://spark-public.s3.amazonaws.com/dataanalysis/samsungData.rda。

数据是由30个实验者将手机邦手腕上,进行“坐”、“立”、“趟”、“上坡走”、“平地走”、“下坡走"等六项不同行为,记录下手机中各种感知器的数据。

数据框的每一行对应一次观测

数据框的第1-561列分别代表一种手机内感应器的读数

第562列对应于被测试者

第563列对应于用户的行为

处理掉变量名中的符号:

将用户的行为变量处理为因子变量(factor variable)

按作业要求,将数据集分为训练集和测试集:

分类树模型的构建过程是一个递归过程,找出一个能将数据集“最好地”分割成两组数据的变量和阀值,然后递归地对每一组数据继续进行分割,直至每一组数据都只含唯一一类数据。

对训练集拟合分类树模型:

拟合出分类树模型的总结,模型从561个变量中挑选出了10个变量用于分类:

分类树模型的可视化:

classification tree

对测试集运用分类树模型进行结果预测,并测试准确率:

模型运用于测试集数据,分类正确率为80.32%