Load Google Drive CSV into Pandas DataFrame for Google Colaboratory

February 9, 2018

Researching platforms for running common machine learning algorithms on advertising keyword data, I found Google Colaboratory. It’s essentially a machine learning environment with accessible Jupyter notebooks…but without all the headaches of setting up your own dev environment. Mainly for educational and research purposes, I wanted to try it out with a small data set.

Of course, it’s never quite that easy. Out the door I needed to figure out how to load up a train.csv and test.csv stored in Google Drive into the Jupyter notebook, and more specifically a Pandas DataFrame. There are instructions but they are intermingled with other things I didn’t care about.

Here is a quick rundown of Juypter code to load existing CSV files stored in Google Drive into Google Colab.

1
2
import tensorflow as tf
tf.test.gpu\_device\_name()

This loads up TensorFlow and displays what GPU is being used. If it returns nothing, go to Runtime -> Change Runtime Type, change Hardware accelerator to GPU and hit save. Then re-run the code above. The result should be:

1
'/device:GPU:0'

If so, you are ready to move on.

1
2
3
4
5
6
!pip install -U -q PyDrive

from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

1
2
3
4
5
# 1. Authenticate and create the PyDrive client.
auth.authenticate\_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get\_application\_default()
drive = GoogleDrive(gauth)

The above code installs PyDrive which will be used to access Google Drive and kicks off the process to authorize the notebook running in the Google Colaboratory environment to touch your files. When it runs, you will be presented with a link to click on, which asks to verify Google Colab can have access to Google Drive and unique key. Enter the key back in the notebook.

1
2
3
file\_list = drive.ListFile({'q': "'<FOLDER ID>' in parents and trashed=false"}).GetList()
for file1 in file\_list:
  print('title: %s, id: %s' % (file1\['title'\], file1\['id'\]))

This code assumes your CSV files are in a folder. It will print out the files in a folder and their unique identifiers which will be used below. Replace with the long string of numbers and letters in the URL of the folder in Google Drive. If the files are located at the top level of Google Drive, replace with ‘root’. The output should look like:

1
2
title: train.csv, id: <TRAIN\_FILE\_ID>
title: test.csv, id: <TEST\_FILE\_ID>

The output will be a list of files in the specified folder and their ids.

1
2
3
4
train\_downloaded = drive.CreateFile({'id': '<TRAIN\_FILE\_ID>'})
train\_downloaded.GetContentFile('train.csv')
test\_downloaded = drive.CreateFile({'id': '<TEST\_FILE\_ID>'})
test\_downloaded.GetContentFile('test.csv')  

Now the files get pulled into Google Colab. GetContentFile saves the files in the local environment and sets the names of the files.

1
2
3
4
import pandas as pd
import numpy as np
df\_train = pd.read\_csv('train.csv')
df\_train

Now comes the easy part. Since the files have been saved to the local environment, load up a saved file, by its filename, into a DataFrame and print out a few lines to verify.

1
2
df\_test = pd.read\_csv('test.csv')
df\_test

Repeat as needed.

If you can’t measure it, you can’t manage it. Meh.Google Analytics v4 API + Node.js Starter Example