Recently, I am working on the speech command recognition competition on Kaggle (https://www.kaggle.com/c/tensorflow-speech-recognition-challenge/) by Google and got $500 Google cloud platform credits. I am writing down rough instructions on how I set up my VM to do experiments with deep learning models on GCP.
- First, request quota increase to use a GPU. I requested usage of a Nvidia Tesla K80 under Zone us-east-1.
- Create VM instance in the requested zone, customize the VM to use GPU, also configure SSH access. Oh, I used an Ubuntu 16.04 os.
- Login to the VM and install the Cuda driver
1234curl -O http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_8.0.61-1_amd64.debdpkg -i ./cuda-repo-ubuntu1604_8.0.61-1_amd64.debapt-get updateapt-get install cuda -y
- Install docker community edition.
12345sudo apt-get -y install apt-transport-https ca-certificates curl software-properties-commoncurl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"sudo apt-get -y updatesudo apt-get -y install docker-ce
- Install Nvidia-docker
12wget https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nvidia-docker_1.0.1-1_amd64.debsudo dpkg -i nvidia-docker*.deb
- Fire up bash and pretty much good to go. Note notebooks is the default landing directory, and here you would want to specify a directory in your GCP VM that you want to share with the container so that it can access your training data and write results to your VM disk.
1sudo nvidia-docker run -it -v /home/<gcpusername>/<folder_want_to_share_with_container>/:/notebooks tensorflow/tensorflow:latest-gpu bash
That's pretty much it. On a side note, I found it strange that my code actually ran slower on my GCP VM using docker than on my home PC with just 1070 card. I am suspecting that since my GCP VM's CPU is the old Haswell one (I tried to provisioning a Skylake one, but the GCP portal keep telling me there is not enough resources to create one for me...), the training is slow due to my data augmentation process in my data generator, so the more powerful Tesla K80 is idle and waiting for batches to go in, and it is totally the fault of my crappy code...
It has been really long since I last posted anything here, but I am thinking about getting more back often.
Incase you are curious about the pricing, here is a screenshot of my current billing page. My VM instance is 8 core cpu, 30GB RAM, 128GB SSD disk and of course a Nvidia Tesla K80.