Synchronize Google Drive to your hard disk using Python

This tutorial demonstrates how you can easily synchronize your GDrive(or particular folders) onto your hard drive. Initially this was implemented because one of our projects required a build system for software which was dependent on large media files, stored in a GDrive folder. Our requirement was simple: Automatically pull the latest files from the server before building the software. Also because we are dealing with large amounts of data, it would be highly inefficient for the software to download all files every single time. Instead we wanted a solution that only updates the files if they have changed.

Here we only cover the downstream link, which means the software will pull changes from the GDrive to your hard disk but not the other way round. The implementation uses the Google Drive API so in order to get started we first need to activate the API for your GDrive.

Setting up the Authentication

In order to do so, please follow the description in the following article:

TODO

This will leave you with an API key in the JSON format. All you need to do is to rename this file to client_secrets.json and put it into your current working directory. The authentication with GDrive is made very simple with the pydrive package. Downloading items can be done using the apiclient package. We use pip to install these packages:

$ pip install pydrive apiclient

Let’s start by importing the libraries that we need:

import os
import sys
import hashlib
from pathlib import Path
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from apiclient import http

Let’s also add some color to the output:

from colorama import init
from colorama import Fore, Back, Style
init()

using pydrive the authentication with GDrive can be done in only three lines of code:

gauth = GoogleAuth()
gauth.LocalWebserverAuth() # client_secrets.json is automatically picked up from the same directory
drive = GoogleDrive(gauth)

If you run this script you should now have a browser window popping up, asking you to sign-in into your Google account and authorize your script.

A bit of background before we start

To understand how we can clone GDrive content into a local directory we need to understand a little bit how GDrive works. GDrive does not have a filesystem like the one on your computer. Instead every file or folder has a unique identifier called the file- or folder ID, which uniquely defines a particular file on your drive.

The files are inter related with each which means you can query files based on the a folder ID for example but not based on a given hierarchy. Each GDrive has a root folder which we can query. Based on that we have to step through the hierarchy.

For example we want to get the file ID for the following file:

/Project Sync/GDrive_Sync.py

Now, in order to get the file ID of GDrive_Sync.py we will first have to query the root, to get the ID of the folder Project Sync and then we need to query that folder in order to get the file ID we are looking for.

Cloning the hierarchy

We will be (re-)creating a lot of folders, so let’s start with defining a function to make a directory on our local drive. This simply wraps the error handling so we don’t have to do this repetitively:

#create a local folder and ignore the error if it already exists
def createFolder(path):
    try:
        os.mkdir(path)
    except OSError as e:
        if e.errno == errno.EEXIST:
            pass #If the folder exists that's fine
        else:
            raise

Now, you might not want to clone the whole GDrive but clone a particular folder instead. This means we first have to get the folderID of the folder we want to clone. To keep this script platform independent it is highly advisable to use Pathlib for all paths:

#We Clone the drive in the folder DriveSync relative to the script.
ROOTDIR = (Path(__file__).parent / "DriveSync").absolute()
createFolder(ROOTDIR)

#we define the GDrive folder to clone
REMOTEDIR = "Project Sync/Clone Me"
remoteHierarchy = REMOTEDIR.split("/")

#query the content of the root of the GDrive
fileList = drive.auth.service.files().list(q = "'root' in parents and trashed=false").execute().get('items',[])

#get the fileID of the target folder
for folder in remoteHierarchy:
    found = False

    #find the target folder in the current folder
    for file in fileList:
        if folder == file['title']:

            found = True
            fileID = file['id']

            #get the content of the folder for the next iteration
            fileList = drive.ListFile({'q': "'" + fileID + "' in parents and trashed=false"}).GetList()

            #re-create the folder structure locally
            ROOTDIR = ROOTDIR / folder
            createFolder(str(ROOTDIR))
            break

    if not found:
        print("could not locate folder: " + str(folder))
        sys.exit(0)

This piece of code will recreate the remote folder structure up to the target path that we specified in the beginning. In this example it will get the folder ID of the GDrive folder Project Sync/Clone Me and the target folder ./DriveSync/Project Sync/Clone Me/ is being created.

Now before we can start to step through the hierarchy and download files, we need to implement a function which can download a file:

#Download a file with the given ID and store it into the specified filename
def downloadFile(id, filename):

    #get the request for the remote file
    request = drive.auth.service.files().get_media(fileId=id)

    try:
        #open the file given in the filename
        with open(filename, "wb") as fd:

            #create a downloader instance and start downloading file
            downloader = http.MediaIoBaseDownload(fd, request)
            done = False

            #wait until we are done here
            while done is False:
                status, done = downloader.next_chunk()
                print("    Download %d%%." % int(status.progress() * 100))
            return True
    except Exception as e:
        print(Fore.YELLOW + "    Error while downloading file: " + str(e) + Style.RESET_ALL)
        return False

We don’t want already existing and up-to-date files to be downloaded again. Therefore we need to be able to determine whether a file has been changed or not. The GDrive API does give us a MD5 checksum for each file so we can use that and compare it:

#Compare the checksum for a given file, return True if it is the same
def compareChecksum(md5, filename):
    #calculat checksum for the local file - if existent
    try:
        checksum = hashlib.md5(open(filename, 'rb').read()).hexdigest()
    except:
        return False

    #compare local and remote checksum
    if md5 == checksum:
        return True
    else:
        print(Fore.YELLOW + "    File has changed ... updating" + Style.RESET_ALL)
        return False

Let’s combine both of these functions:

#download a file only if there is no matching file already
def processFile(file, filename):
    success = False
    #first we compare the checksum
    if not compareChecksum(file["md5Checksum"], filename):
        #let's give three attempts to download the file
        for i in range(3):
            #download the file based on it's id
            if downloadFile(file['id'], filename):

                #compare the checksum again
                if compareChecksum(file["md5Checksum"], filename):
                    success = True
                    print(Fore.GREEN + "    file up to date" + Style.RESET_ALL)
                    break

    else:
        print(Fore.GREEN + "    file up to date" + Style.RESET_ALL)
        success = True
    return success

Now let’s create some queues to memorize and step through the hierarchy:

#We need two queues to memorize the current path - remote and local
remote_queue = [fileID]
local_queue = [ROOTDIR]

#process all items in the remote queue
while len(remote_queue) != 0:

    #get the folder id
    current_folder_id = remote_queue.pop(0)
    file_list = drive.auth.service.files().list(q = "'{}' in parents and trashed=false".format(current_folder_id)).execute().get('items',[])

    current_parent = local_queue.pop(0)
    print(str(current_parent), current_folder_id)

    #iterate through all the files in the remote directory
    for file1 in file_list:

        #generate the filename for the local filesystem
        filename = current_parent / file1['title']

        if file1['mimeType'] == 'application/vnd.google-apps.folder':
            #if the file is a folder, add another folder to process
            remote_queue.append(file1['id'])
            local_queue.append(filename)
            createFolder(filename)
        else:
            #if it is a file, process it (compare checksum, download if neccessary)
            print("processing file: " + file1['title'])
            if not processFile(file1, filename):
                print(Fore.RED + "    Unable to Download File ... skipping" + Style.RESET_ALL)

And this is all that we have to do. This script will automatically clone an entire GDrive structure onto your local disk. If a file exists, the checksum of the remote file and local file are compared, in case they differ the file is being updated, otherwise nothing is being done.

Possible improvements

As you would expect from a purely linear method like this one, performance wise this method can be improved. Downloading and synchronizing one file at a time is not the fastest way to implement this. I expect that a multithreaded solution might be many times faster. You would have to, however, make sure that you do not violate the hit limit of the GDrive API, otherwise Google might simply reject your requests.

Featured Image by AltumCode on Unsplash