Category Archives: Python

Pass more than one arguments to Pool.map

I am recently tackling this year's kdd cup competition. I was trying to speed up my code to fit Prophet model on multiple time series from a pandas dataframe using python's multiprocessing module. Below is an example how the map function of Pool class from multiprocessing module works:

Basically, Pool(8) creates a process pool object with 8 processes. p.map(square, range(16)) chop the iterable into 8 pieces and assign them to the 8 processes. For each process in the pool, it applies the function square on each element(in this case 2 in total) in the smaller iterable assigned to it. The results were collected into the results object.

One possible way for me to use this mechanism to fit my models is to prepare the data into smaller dataframes with columns ['ds', 'y'], collect these smaller dataframes into a list and map a Prophet wrapper function to this list using a pool of processes. The wrapper function would look like this:

This would work, but it would require duplicated computation on future time_stamps and number of changepoints which are the same for all my time series. Ideally, I want to be able to calculate them once and share the value. One obvious way to do so, is to change my run list from a collection of dataframes to a list of tuples like (df, n_points, holidays, future). Or another method I ended up using is to use partial.

 

Python Matrix Transpose oneliner two ways

I was asked to write some python code to transpose a matrix (represented as a list of lists) during a recent interview. I quickly came up with a oneliner using list comprehension like this:

It works, but I was asked to make it shorter and I wasn't able to think of a way to do so in front of the whiteboard. So I have been thinking about this little exercise since then, and finally came up with one that is pretty elegant.

The way it works is that: *M is taken in as the *args in zip() thus unpacked as zip([0,1,0,0], [1,1,1,0], [0,1,0,0], [1,1,0,0]) .

 

Build Tensorflow from source with GPU support after the meltdown kernel patch

So my google could compute instance with Nvidia-docker which I used to train deep learning models was suddenly not working a couple of days ago, and the reason seems to related to the recent Ubuntu kernel update that was intended to solve the meltdown issue. I found the solution is to install a different kernel and also build from source. As a reminder for myself, here are the steps:

  1. Get a newer kenel:
  2. Install dependencies for build tensorflow, also install cuda if not already. Remeber to reboot after install cuda.
  3.  Might need to do the following
  4. Check your GPU with
  5. Install cuDNN, go to https://developer.nvidia.com/cudnn and download the files, scp them to the machine and install :
  6. Get libcupti and Brazel
  7. Get Tensorflow source and build with gpu support and finally install
  8. You are good to go

Kaggle Digit Recognizer Revisited (Using Convolutional NN with Keras)

Almost a year ago, I revisited the Kaggle version of the Hand Written Digit Recognition problem, the link to that post is here. At that time, my go to language is R, since the majority of friends around me use R as well. This summer, I evidently switched back to use python as my primary language to do almost everything, because it is just so efficient.

So, here is a convolutional neural network using Keras to tackle this problem again, in less than 100 lines of code you can get a convolutional neural network and obtain 99% accuracy on the Kaggle leaderboard.

A quick note about training time, it took close to 9 minutes to be trained on my laptop with GeForce GTX 970M chip. You can increase the number of epochs and run it by yourself, it should be able to lead to better results.

 

A Logistic Regression Benchmark for Red Hat Customer Business Value Prediction Problem

Red Hat put out a competition on Kaggle asking people to build models to predict customer potential. It is a simple binary classification problem and the metric to this problem that Red Hat wanted to determine which model rank best is the AUC score.

I am sort of late in participating in this competition, and there are only 7 days to go. I sketched a rather simple logistic regression model, and it ranks somewhere in the middle among 2,200 teams in total. Kind of surprised to see that a simple logistic regression can beat half of the participants.

My model uses all the features and I find out the penalty strength parameter C should take on value 10.

Below is my code: