Using Drive API / DriveApp to convert from PDFs to Google Documents

The Task

I have 117 PDF files (average size ~238 KB) uploaded to Google Drive. I want to convert them all to Google Docs and keep them in a different Drive folder.

The Problem

I attempted to convert the files using Drive.Files.insert. However, under most circumstances, only 5 files could be converted this way before the function expires prematurely with this error

Limit Exceeded: DriveApp. (line #, file "Code")

where the line referenced above is when the insert function is called. After calling this function for the first time, subsequent calls typically failed immediately with no additional google doc created.

Approach

I used 3 main ways to achieve my goal. One was using the Drive.Files.insert, as mentioned above. The other two involved using Drive.Files.copy and sending a batch of HTTP requests. These last two methods were suggested by Tanaike, and I recommend reading his answer below for more information. The insert and copy functions are from Google Drive REST v2 API, while batching multiple HTTP requests is from Drive REST v3.

With Drive.Files.insert, I experienced issues dealing with execution limitations (explained in the Problem section above). One solution was to run the functions multiple times. And for that, I needed a way to keep track of which files were converted. I had two options for this: using a spreadsheet and a continuation token. Therefore, I had 4 different methods to test: the two mentioned in this paragraph, batching HTTP requests, and calling Drive.Files.copy.

Because team drives behave differently from regular drives, I felt it necessary to try each of those methods twice, one in which the folder containing the PDFs is a regular non-Team Drive folder and one in which that folder is under a Team Drive. In total, this means I had 8 different methods to test.

These are the exact functions I used. Each of these was used twice, with the only variations being the ID of the source and destination folders (for reasons stated above):

Method A: Using Drive.Files.insert and a spreadsheet

function toDocs() {
  var sheet = SpreadsheetApp.openById(/* spreadsheet id*/).getSheets()[0];
  var range = sheet.getRange("A2:E118");
  var table = range.getValues();
  var len = table.length;
  var resources = {
    title: null,
    mimeType: MimeType.GOOGLE_DOCS,
    parents: [{id: /* destination folder id */}]
  };
  var count = 0;
  var files = DriveApp.getFolderById(/* source folder id */).getFiles();
  while (files.hasNext()) {
    var blob = files.next().getBlob();
    var blobName = blob.getName();
    for (var i=0; i<len; i++) {
      if (table[i][0] === blobName.slice(5, 18)) {
        if (table[i][4])
          break;
        resources.title = blobName;
        Drive.Files.insert(resources, blob);  // Limit Exceeded: DriveApp. (line 51, file "Code")
        table[i][4] = "yes";
      }
    }

    if (++count === 10) {
      range.setValues(table);
      Logger.log("time's up");
    }
  }
}

Method B: Using Drive.Files.insert and a continuation token

function toDocs() {
  var folder = DriveApp.getFolderById(/* source folder id */);
  var sprop = PropertiesService.getScriptProperties();
  var contToken = sprop.getProperty("contToken");
  var files = contToken ? DriveApp.continueFileIterator(contToken) : folder.getFiles();
  var options = {
    ocr: true
  };
  var resource = {
    title: null,
    mimeType: null,
    parents: [{id: /* destination folder id */}]
  };

  while (files.hasNext()) {
    var blob = files.next().getBlob();
    resource.title = blob.getName();
    resource.mimeType = blob.getContentType();
    Drive.Files.insert(resource, blob, options);  // Limit Exceeded: DriveApp. (line 113, file "Code")
    sprop.setProperty("contToken", files.getContinuationToken());
  }
}

Method C: Using Drive.Files.copy

Credit for this function goes to Tanaike -- see his answer below for more details.

function toDocs() {
  var sourceFolderId = /* source folder id */;
  var destinationFolderId = /* destination folder id */;
  var files = DriveApp.getFolderById(sourceFolderId).getFiles();
  while (files.hasNext()) {
    var res = Drive.Files.copy({parents: [{id: destinationFolderId}]}, files.next().getId(), {convert: true, ocr: true});
    Logger.log(res) 
  }
}

Method D: Sending batches of HTTP requests

Credit for this function goes to Tanaike -- see his answer below for more details.

function toDocs() {
  var sourceFolderId = /* source folder id */;
  var destinationFolderId = /* destination folder id */;

  var files = DriveApp.getFolderById(sourceFolderId).getFiles();
  var rBody = [];
  while (files.hasNext()) {
    rBody.push({
      method: "POST",
      endpoint: "https://www.googleapis.com/drive/v3/files/" + files.next().getId() + "/copy",
      requestBody: {
        mimeType: "application/vnd.google-apps.document",
        parents: [destinationFolderId]
      }
    });
  }
  var cycle = 20; // Number of API calls at 1 batch request.
  for (var i = 0; i < Math.ceil(rBody.length / cycle); i++) {
    var offset = i * cycle;
    var body = rBody.slice(offset, offset + cycle);
    var boundary = "xxxxxxxxxx";
    var contentId = 0;
    var data = "--" + boundary + "\r\n";
    body.forEach(function(e){
      data += "Content-Type: application/http\r\n";
      data += "Content-ID: " + ++contentId + "\r\n\r\n";
      data += e.method + " " + e.endpoint + "\r\n";
      data += e.requestBody ? "Content-Type: application/json; charset=utf-8\r\n\r\n" : "\r\n";
      data += e.requestBody ? JSON.stringify(e.requestBody) + "\r\n" : "";
      data += "--" + boundary + "\r\n";
    });
    var options = {
      method: "post",
      contentType: "multipart/mixed; boundary=" + boundary,
      payload: Utilities.newBlob(data).getBytes(),
      headers: {'Authorization': 'Bearer ' + ScriptApp.getOAuthToken()},
      muteHttpExceptions: true,
    };
    var res = UrlFetchApp.fetch("https://www.googleapis.com/batch", options).getContentText();
//    Logger.log(res); // If you use this, please remove the comment.
  }
}

What Worked and What Didn't

None of the functions using Drive.Files.insert worked. Every function using insert for conversion failed with this error

Limit Exceeded: DriveApp. (line #, file "Code")

(line number replaced with generic symbol). No further details or description of the error could be found. A notable variation was one in which I used a spreadsheet and the PDFs were in a team drive folder; while all other methods failed instantly without converting a single file, this one converted 5 before failing. However, when considering why this variation did better than the others, I think it was more of a fluke than any reason related to the use of particular resources (spreadsheet, team drive, etc.)
Using Drive.Files.copy and batch HTTP requests worked only when the source folder was a personal (non-Team Drive) folder.
Attempting to use the copy function while reading from a Team Drive folder fails with this error:

File not found: 1RAGxe9a_-euRpWm3ePrbaGaX5brpmGXu (line #, file "Code")

(line number replaced with generic symbol). The line being referenced is
```
var res = Drive.Files.copy({parents: [{id: destinationFolderId}]}, files.next().getId(), {convert: true, ocr: true});
```
Using batch HTTP requests while reading from a Team Drive folder does nothing -- no doc files are created and no errors are thrown. Function silently terminates without having accomplished anything.

Conclusion

If you wish to convert a large number of PDFs to google docs or text files, then use Drive.Files.copy or send batches of HTTP requests and make sure that the PDFs are stored in a personal drive rather than a Team Drive.

Special thanks to @tehhowch for taking such an avid interest in my question and for repeatedly coming back to provide feedback, and to @Tanaike for providing code along with explanations that successfully solved my problem (with a caveat, read above for details).

355

asked Mar 16 '18 04:03

Manuel

2 Answers

You want to convert from PDF files in the folder to Google Documents. PDF files are in a folder of team drive. You want to import converted them to a folder of your Google Drive. If my understanding is correct, how about this method?

For the conversion from PDF to Google Document, it can convert using not only Drive.Files.insert(), but also Drive.Files.copy(). The advantage of use of Drive.Files.copy() is

Although Drive.Files.insert() has the size limitation of 5 MB, Drive.Files.copy() can use over the size of 5 MB.
In my envoronment, the process speed was faster than Drive.Files.insert().

For this method, I would like to propose the following 2 patterns.

Pattern 1 : Using Drive API v2

In this case, Drive API v2 of Advanced Google Services is used for converting files.

function myFunction() {
  var sourceFolderId = "/* source folder id */";
  var destinationFolderId = "/* dest folder id */";
  var files = DriveApp.getFolderById(sourceFolderId).getFiles();
  while (files.hasNext()) {
    var res = Drive.Files.copy({parents: [{id: destinationFolderId}]}, files.next().getId(), {convert: true, ocr: true});
//    Logger.log(res) // If you use this, please remove the comment.
  }
}

Pattern 2 : Using Drive API v3

In this case, Drive API v3 is used for converting files. And here, I used the batch requests for this situation. Because the batch requests can use 100 API calls by one API call. By this, the issue of API quota can be removed.

function myFunction() {
  var sourceFolderId = "/* source folder id */";
  var destinationFolderId = "/* dest folder id */";

  var files = DriveApp.getFolderById(sourceFolderId).getFiles();
  var rBody = [];
  while (files.hasNext()) {
    rBody.push({
      method: "POST",
      endpoint: "https://www.googleapis.com/drive/v3/files/" + files.next().getId() + "/copy",
      requestBody: {
        mimeType: "application/vnd.google-apps.document",
        parents: [destinationFolderId]
      }
    });
  }
  var cycle = 100; // Number of API calls at 1 batch request.
  for (var i = 0; i < Math.ceil(rBody.length / cycle); i++) {
    var offset = i * cycle;
    var body = rBody.slice(offset, offset + cycle);
    var boundary = "xxxxxxxxxx";
    var contentId = 0;
    var data = "--" + boundary + "\r\n";
    body.forEach(function(e){
      data += "Content-Type: application/http\r\n";
      data += "Content-ID: " + ++contentId + "\r\n\r\n";
      data += e.method + " " + e.endpoint + "\r\n";
      data += e.requestBody ? "Content-Type: application/json; charset=utf-8\r\n\r\n" : "\r\n";
      data += e.requestBody ? JSON.stringify(e.requestBody) + "\r\n" : "";
      data += "--" + boundary + "\r\n";
    });
    var options = {
      method: "post",
      contentType: "multipart/mixed; boundary=" + boundary,
      payload: Utilities.newBlob(data).getBytes(),
      headers: {'Authorization': 'Bearer ' + ScriptApp.getOAuthToken()},
      muteHttpExceptions: true,
    };
    var res = UrlFetchApp.fetch("https://www.googleapis.com/batch", options).getContentText();
//    Logger.log(res); // If you use this, please remove the comment.
  }
}

Note :

If the number of API calls at 1 batch request is large (the current value is 100), please modify var cycle = 100.
If Drive API v3 cannot be used for team drive, please tell me. I can convert it for Drive API v2.
If the team drive is the reason of issue for your situation, can you try this after it copied PDF files to your Google Drive?

Reference :

Batching Requests

If these are not useful for you, I'm sorry.

106

answered Sep 27 '22 22:09

Tanaike

You can first of all fetch and store id of all files in a google sheet. Then you can proceed with processing each file normally by using it's id. Then after you have processed them mark that file as processed. And before processing a file check if that file is already processed.

If there are several files then you can also store the row number till where you have processed, next time continue after that.

Then at last create a trigger to execute your function every 10 minutes or so.

By this you can overcome execution time limit for single execution. API request quota and all will not be by-passed by this method.

answered Sep 27 '22 23:09

Umair Mohammad

Related questions
                            
                                Continuous looping scroll (no jQuery)
                            
                                Selenium verify that div has a scrollbar
                            
                                Printing an error object in javascript
                            
                                How do I sign into Google using Puppeteer?
                            
                                Testing native event emitters and natives modules in React Native
                            
                                Javascript Pull/Push Single/Multiple
                            
                                Unhandled Rejection Type Error from API fetch
                            
                                Video play() api in javascript is not working in safari until I change the preference of autoplay to 'Allow all Autoplay'
                            
                                AngularJS Google Maps Click Event not firing in ios devices ipad and iphone
                            
                                Vuejs -- Computed property not being defined on the instance
                            
                                How do I make a React component editable?
                            
                                What is the location of .babelrc file when using create-react-app?
                            
                                How to export Router instance in Vue.js 2
                            
                                Angular forms custom validator null return not removing error from form
                            
                                React + D3 force layout -- Circles not draggable anymore a few seconds after render
                            
                                R/Shiny selectInput widget size
                            
                                Language special characters in vue.js template badly encoded
                            
                                Webpack 4 Code splitting generating separate vendor files for each route
                            
                                ASP.NET Core - Call controller method in javascript without Ajax
                            
                                Regex for a fill-in-the-blank type string

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With