Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Drive API / DriveApp to convert from PDFs to Google Documents

This problem has been successfully resolved. I am editing my post to document my experience for posterity and future reference.

The Task

I have 117 PDF files (average size ~238 KB) uploaded to Google Drive. I want to convert them all to Google Docs and keep them in a different Drive folder.

The Problem

I attempted to convert the files using Drive.Files.insert. However, under most circumstances, only 5 files could be converted this way before the function expires prematurely with this error

Limit Exceeded: DriveApp. (line #, file "Code")

where the line referenced above is when the insert function is called. After calling this function for the first time, subsequent calls typically failed immediately with no additional google doc created.

Approach

I used 3 main ways to achieve my goal. One was using the Drive.Files.insert, as mentioned above. The other two involved using Drive.Files.copy and sending a batch of HTTP requests. These last two methods were suggested by Tanaike, and I recommend reading his answer below for more information. The insert and copy functions are from Google Drive REST v2 API, while batching multiple HTTP requests is from Drive REST v3.

With Drive.Files.insert, I experienced issues dealing with execution limitations (explained in the Problem section above). One solution was to run the functions multiple times. And for that, I needed a way to keep track of which files were converted. I had two options for this: using a spreadsheet and a continuation token. Therefore, I had 4 different methods to test: the two mentioned in this paragraph, batching HTTP requests, and calling Drive.Files.copy.

Because team drives behave differently from regular drives, I felt it necessary to try each of those methods twice, one in which the folder containing the PDFs is a regular non-Team Drive folder and one in which that folder is under a Team Drive. In total, this means I had 8 different methods to test.

These are the exact functions I used. Each of these was used twice, with the only variations being the ID of the source and destination folders (for reasons stated above):

Method A: Using Drive.Files.insert and a spreadsheet

function toDocs() {
  var sheet = SpreadsheetApp.openById(/* spreadsheet id*/).getSheets()[0];
  var range = sheet.getRange("A2:E118");
  var table = range.getValues();
  var len = table.length;
  var resources = {
    title: null,
    mimeType: MimeType.GOOGLE_DOCS,
    parents: [{id: /* destination folder id */}]
  };
  var count = 0;
  var files = DriveApp.getFolderById(/* source folder id */).getFiles();
  while (files.hasNext()) {
    var blob = files.next().getBlob();
    var blobName = blob.getName();
    for (var i=0; i<len; i++) {
      if (table[i][0] === blobName.slice(5, 18)) {
        if (table[i][4])
          break;
        resources.title = blobName;
        Drive.Files.insert(resources, blob);  // Limit Exceeded: DriveApp. (line 51, file "Code")
        table[i][4] = "yes";
      }
    }

    if (++count === 10) {
      range.setValues(table);
      Logger.log("time's up");
    }
  }
}

Method B: Using Drive.Files.insert and a continuation token

function toDocs() {
  var folder = DriveApp.getFolderById(/* source folder id */);
  var sprop = PropertiesService.getScriptProperties();
  var contToken = sprop.getProperty("contToken");
  var files = contToken ? DriveApp.continueFileIterator(contToken) : folder.getFiles();
  var options = {
    ocr: true
  };
  var resource = {
    title: null,
    mimeType: null,
    parents: [{id: /* destination folder id */}]
  };

  while (files.hasNext()) {
    var blob = files.next().getBlob();
    resource.title = blob.getName();
    resource.mimeType = blob.getContentType();
    Drive.Files.insert(resource, blob, options);  // Limit Exceeded: DriveApp. (line 113, file "Code")
    sprop.setProperty("contToken", files.getContinuationToken());
  }
}

Method C: Using Drive.Files.copy

Credit for this function goes to Tanaike -- see his answer below for more details.

function toDocs() {
  var sourceFolderId = /* source folder id */;
  var destinationFolderId = /* destination folder id */;
  var files = DriveApp.getFolderById(sourceFolderId).getFiles();
  while (files.hasNext()) {
    var res = Drive.Files.copy({parents: [{id: destinationFolderId}]}, files.next().getId(), {convert: true, ocr: true});
    Logger.log(res) 
  }
}

Method D: Sending batches of HTTP requests

Credit for this function goes to Tanaike -- see his answer below for more details.

function toDocs() {
  var sourceFolderId = /* source folder id */;
  var destinationFolderId = /* destination folder id */;

  var files = DriveApp.getFolderById(sourceFolderId).getFiles();
  var rBody = [];
  while (files.hasNext()) {
    rBody.push({
      method: "POST",
      endpoint: "https://www.googleapis.com/drive/v3/files/" + files.next().getId() + "/copy",
      requestBody: {
        mimeType: "application/vnd.google-apps.document",
        parents: [destinationFolderId]
      }
    });
  }
  var cycle = 20; // Number of API calls at 1 batch request.
  for (var i = 0; i < Math.ceil(rBody.length / cycle); i++) {
    var offset = i * cycle;
    var body = rBody.slice(offset, offset + cycle);
    var boundary = "xxxxxxxxxx";
    var contentId = 0;
    var data = "--" + boundary + "\r\n";
    body.forEach(function(e){
      data += "Content-Type: application/http\r\n";
      data += "Content-ID: " + ++contentId + "\r\n\r\n";
      data += e.method + " " + e.endpoint + "\r\n";
      data += e.requestBody ? "Content-Type: application/json; charset=utf-8\r\n\r\n" : "\r\n";
      data += e.requestBody ? JSON.stringify(e.requestBody) + "\r\n" : "";
      data += "--" + boundary + "\r\n";
    });
    var options = {
      method: "post",
      contentType: "multipart/mixed; boundary=" + boundary,
      payload: Utilities.newBlob(data).getBytes(),
      headers: {'Authorization': 'Bearer ' + ScriptApp.getOAuthToken()},
      muteHttpExceptions: true,
    };
    var res = UrlFetchApp.fetch("https://www.googleapis.com/batch", options).getContentText();
//    Logger.log(res); // If you use this, please remove the comment.
  }
}

What Worked and What Didn't

  • None of the functions using Drive.Files.insert worked. Every function using insert for conversion failed with this error

    Limit Exceeded: DriveApp. (line #, file "Code")

    (line number replaced with generic symbol). No further details or description of the error could be found. A notable variation was one in which I used a spreadsheet and the PDFs were in a team drive folder; while all other methods failed instantly without converting a single file, this one converted 5 before failing. However, when considering why this variation did better than the others, I think it was more of a fluke than any reason related to the use of particular resources (spreadsheet, team drive, etc.)

  • Using Drive.Files.copy and batch HTTP requests worked only when the source folder was a personal (non-Team Drive) folder.

  • Attempting to use the copy function while reading from a Team Drive folder fails with this error:

    File not found: 1RAGxe9a_-euRpWm3ePrbaGaX5brpmGXu (line #, file "Code")

    (line number replaced with generic symbol). The line being referenced is

    var res = Drive.Files.copy({parents: [{id: destinationFolderId}]}, files.next().getId(), {convert: true, ocr: true});
    
  • Using batch HTTP requests while reading from a Team Drive folder does nothing -- no doc files are created and no errors are thrown. Function silently terminates without having accomplished anything.

Conclusion

If you wish to convert a large number of PDFs to google docs or text files, then use Drive.Files.copy or send batches of HTTP requests and make sure that the PDFs are stored in a personal drive rather than a Team Drive.


Special thanks to @tehhowch for taking such an avid interest in my question and for repeatedly coming back to provide feedback, and to @Tanaike for providing code along with explanations that successfully solved my problem (with a caveat, read above for details).

like image 355
Manuel Avatar asked Mar 16 '18 04:03

Manuel


People also ask

How do I convert a PDF to Google Docs from Google Drive?

Right-click on the PDF file in the file list and select Open With > Google Docs. The PDF file is converted to a Google Doc containing editable text. Notice that the Google Docs version of the file still has the . pdf extension on it, so the files have the same name in the list.

Can you automatically convert uploaded files to Google Docs format?

You can choose to automatically convert any document that you upload to Google Drive. To do that, click the gear icon on the top right-hand side of the screen and choose Settings. Place a check mark next to Convert uploaded files to Google Docs editor format and then click Done.

How do I convert a PDF to a Google Doc and keep formatting?

Navigate to the PDF you want to convert and click Open. You will see that the converted Doc file will have the same formatting, spacing, indentations, etc. Now, click on Enable Editing at the top of the window. Go to File and select Save AS and save the file.


2 Answers

You want to convert from PDF files in the folder to Google Documents. PDF files are in a folder of team drive. You want to import converted them to a folder of your Google Drive. If my understanding is correct, how about this method?

For the conversion from PDF to Google Document, it can convert using not only Drive.Files.insert(), but also Drive.Files.copy(). The advantage of use of Drive.Files.copy() is

  • Although Drive.Files.insert() has the size limitation of 5 MB, Drive.Files.copy() can use over the size of 5 MB.
  • In my envoronment, the process speed was faster than Drive.Files.insert().

For this method, I would like to propose the following 2 patterns.

Pattern 1 : Using Drive API v2

In this case, Drive API v2 of Advanced Google Services is used for converting files.

function myFunction() {
  var sourceFolderId = "/* source folder id */";
  var destinationFolderId = "/* dest folder id */";
  var files = DriveApp.getFolderById(sourceFolderId).getFiles();
  while (files.hasNext()) {
    var res = Drive.Files.copy({parents: [{id: destinationFolderId}]}, files.next().getId(), {convert: true, ocr: true});
//    Logger.log(res) // If you use this, please remove the comment.
  }
}

Pattern 2 : Using Drive API v3

In this case, Drive API v3 is used for converting files. And here, I used the batch requests for this situation. Because the batch requests can use 100 API calls by one API call. By this, the issue of API quota can be removed.

function myFunction() {
  var sourceFolderId = "/* source folder id */";
  var destinationFolderId = "/* dest folder id */";

  var files = DriveApp.getFolderById(sourceFolderId).getFiles();
  var rBody = [];
  while (files.hasNext()) {
    rBody.push({
      method: "POST",
      endpoint: "https://www.googleapis.com/drive/v3/files/" + files.next().getId() + "/copy",
      requestBody: {
        mimeType: "application/vnd.google-apps.document",
        parents: [destinationFolderId]
      }
    });
  }
  var cycle = 100; // Number of API calls at 1 batch request.
  for (var i = 0; i < Math.ceil(rBody.length / cycle); i++) {
    var offset = i * cycle;
    var body = rBody.slice(offset, offset + cycle);
    var boundary = "xxxxxxxxxx";
    var contentId = 0;
    var data = "--" + boundary + "\r\n";
    body.forEach(function(e){
      data += "Content-Type: application/http\r\n";
      data += "Content-ID: " + ++contentId + "\r\n\r\n";
      data += e.method + " " + e.endpoint + "\r\n";
      data += e.requestBody ? "Content-Type: application/json; charset=utf-8\r\n\r\n" : "\r\n";
      data += e.requestBody ? JSON.stringify(e.requestBody) + "\r\n" : "";
      data += "--" + boundary + "\r\n";
    });
    var options = {
      method: "post",
      contentType: "multipart/mixed; boundary=" + boundary,
      payload: Utilities.newBlob(data).getBytes(),
      headers: {'Authorization': 'Bearer ' + ScriptApp.getOAuthToken()},
      muteHttpExceptions: true,
    };
    var res = UrlFetchApp.fetch("https://www.googleapis.com/batch", options).getContentText();
//    Logger.log(res); // If you use this, please remove the comment.
  }
}

Note :

  • If the number of API calls at 1 batch request is large (the current value is 100), please modify var cycle = 100.
  • If Drive API v3 cannot be used for team drive, please tell me. I can convert it for Drive API v2.
  • If the team drive is the reason of issue for your situation, can you try this after it copied PDF files to your Google Drive?

Reference :

  • Batching Requests

If these are not useful for you, I'm sorry.

like image 106
Tanaike Avatar answered Sep 27 '22 22:09

Tanaike


You can first of all fetch and store id of all files in a google sheet. Then you can proceed with processing each file normally by using it's id. Then after you have processed them mark that file as processed. And before processing a file check if that file is already processed.

If there are several files then you can also store the row number till where you have processed, next time continue after that.

Then at last create a trigger to execute your function every 10 minutes or so.

By this you can overcome execution time limit for single execution. API request quota and all will not be by-passed by this method.

like image 25
Umair Mohammad Avatar answered Sep 27 '22 23:09

Umair Mohammad