Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Frequent HTTP 500 Internal Errors with Google Drive drive.files.get API

We have a service which is highly dependent on Google Drive (it uses the Python SDK), our service goes through Google Drive collections and files.

Checking the production log, we found that there are many HTTP 500 Server Internal Errors when we call Google Drive's drive.files.get API endpoint. The error rate is about 0.5%. After investigating, I found that the extreme case is continuous 9 HTTP 500 failure in one hour.

The exceptions look like this:

  File "/home/xxxxxx/xxxxxxx/storage.py", line 1185, in get_file
    gdrive_file = self.client.files().get(fileId='0Bxn2GmQxR4zHYlNvaUlFNjl6MkE', fields='id,title,modifiedDate,createdDate,fileSize,mimeType,downloadUrl,labels').execute()
  File "/usr/lib/python2.7/dist-packages/apiclient/http.py", line 389, in execute
    raise HttpError(resp, content, self.uri)
HttpError: <HttpError 500 when requesting https://www.googleapis.com/drive/v2/files/0Bxn2GmQxR4zHYlNvaUlFNjl6MkE?fields=id%2Ctitle%2CmodifiedDate%2CcreatedDate%2CfileSize%2CmimeType%2CdownloadUrl%2Clabels&alt=json returned "Internal Error">

Our service is hosted on Amazon Web Service, in the US WEST-2 data center.

Has anyone had a similar issue? Any help is appreciated.

like image 432
evanchin Avatar asked Sep 18 '12 05:09

evanchin


1 Answers

Because Google infrastructure is complex, large scale and distributed it is close to impossible to have a 0% error rate - servers or hard disks dying during the request, unexpected timeouts between servers internally, datacenter outage or increased load, tentative DOS attacks, misbehaving applications... - all of which might raise the 500's error rate - so as a general good practice, implementing an exponential backoff and retry strategy on your end is good when you deal with Web APIs and actually it is almost mandatory if you want to offer a reliable service, also on your end it might smooth out temporary network glitch etc...

Now 0.5% is a bit high, I believe the global error rate is lower in average but I'm going to bring it up to the Drive team so they investigate and try to reduce this (sometimes it's just about increasing a timeout to one of our server dependencies). We're always taking passes at reducing the error rate but sometimes we have to spend time building new features especially when the products are rather new :)

like image 111
Nicolas Garnier Avatar answered Nov 06 '22 16:11

Nicolas Garnier