Category Archives: Python

Build Tensorflow from source with GPU support after the meltdown kernel patch

So my google could compute instance with Nvidia-docker which I used to train deep learning models was suddenly not working a couple of days ago, and the reason seems to related to the recent Ubuntu kernel update that was intended to solve the meltdown issue. I found the solution is to install a different kernel and also build from source. As a reminder for myself, here are the steps:

  1. Get a newer kenel:
  2. Install dependencies for build tensorflow, also install cuda if not already. Remeber to reboot after install cuda.
  3.  Might need to do the following
  4. Check your GPU with
  5. Install cuDNN, go to https://developer.nvidia.com/cudnn and download the files, scp them to the machine and install :
  6. Get libcupti and Brazel
  7. Get Tensorflow source and build with gpu support and finally install
  8. You are good to go

Kaggle Digit Recognizer Revisited (Using Convolutional NN with Keras)

Almost a year ago, I revisited the Kaggle version of the Hand Written Digit Recognition problem, the link to that post is here. At that time, my go to language is R, since the majority of friends around me use R as well. This summer, I evidently switched back to use python as my primary language to do almost everything, because it is just so efficient.

So, here is a convolutional neural network using Keras to tackle this problem again, in less than 100 lines of code you can get a convolutional neural network and obtain 99% accuracy on the Kaggle leaderboard.

A quick note about training time, it took close to 9 minutes to be trained on my laptop with GeForce GTX 970M chip. You can increase the number of epochs and run it by yourself, it should be able to lead to better results.

 

A Logistic Regression Benchmark for Red Hat Customer Business Value Prediction Problem

Red Hat put out a competition on Kaggle asking people to build models to predict customer potential. It is a simple binary classification problem and the metric to this problem that Red Hat wanted to determine which model rank best is the AUC score.

I am sort of late in participating in this competition, and there are only 7 days to go. I sketched a rather simple logistic regression model, and it ranks somewhere in the middle among 2,200 teams in total. Kind of surprised to see that a simple logistic regression can beat half of the participants.

My model uses all the features and I find out the penalty strength parameter C should take on value 10.

Below is my code:

Quick function to drop duplicated columns in Pandas DataFrame

Pandas has a nice function that will check and drop duplicated rows for a given data frame, but it can not work for dropping duplicated columns directly. A quick walkaround is to transpose the data frame first, drop duplicated rows and then transpose again. However, when the size of the data frame gets larger, not only this method took a long time, it eventually will break. I tried it on a data frame of size 100,000 by 41 and got a runtime error.

Below is a little function I wrote to find and drop duplicated columns of Pandas data frame.

A quick post on Decision Trees

Today is my one year anniversary studying at Bentley University in MA, US. And this short post on this special day is devoted to Decision Trees, a simple but often misused method. During my short time here, and limited times talking to marketing analysts, I saw some cases when they just take the rules directly from decision trees and throw them in decks and say that they will show that to executives as "insights". I think they should be a bit more careful since the decision tree can be rather unstable, i.e. small changes in the dataset can result in completely different trees.

Below are some python scripts, which i used to build classification trees on the iris data. But each time, I do some sort of random sampling and feed the sample to the algorithm to build trees. You can look at the difference  yourself.

I will just upload pictures of a few of these trees.

iris_dt_entropy_70_percent_train_seed_36 iris_dt_entropy_70_percent_train_seed_1106

The two trees above are built on different random samples of the iris dataset. From the first one you can get the rule: if petal length is less than or equal to 2.45 cm then the flower is a setosa. Wheras from the second decision tree you get the rule: if petal width is less than or equal to 0.8 cm then the flower is a setosa. Take a quick look at the scatter plot you will see that they are both pretty good rules, and if you only build one decision tree, you will miss one of the very good rules.

colored_scatter_plot_iris

Below are two trees built on different samples of the iris dataset again. And I personly would not try to interpret the splitting rules from the third level in the tree and on (take root node as the first level).



iris_dt_entropy_num_per_class_15_45_5 iris_dt_entropy_num_per_class_15_45_10

Anyway, the takeaway is that: try different algorithms, try different samples as input, build many trees and take what's common from them, rather than simply build one tree.