I searched around the web & Stack Overflow but didn't find a solution. What I try to do is the following: I get certain attachments via mail that I would like to have as (Plain) text for further processing. My script looks like this:
function MyFunction() {
var threads = GmailApp.search ('label:templabel');
var messages = GmailApp.getMessagesForThreads(threads);
for (i = 0; i < messages.length; ++i)
{
j = messages[i].length;
var messageBody = messages[i][0].getBody();
var messageSubject = messages [i][0].getSubject();
var attach = messages [i][0].getAttachments();
var attachcontent = attach.getContentAsString();
GmailApp.sendEmail("mail", messageSubject, "", {htmlBody: attachcontent});
}
}
Unfortunately this doesn't work. Does anybody here have an idea how I can do this? Is it even possible?
Thank you very much in advance.
Best, Phil
Steps to Configure Attachment Compliance in GmailScroll down to find Gmail and click on it. Scroll down to Advanced settings and click on it. Scroll down to Attachment compliance and hover on it and to see configure option and click on it. Select the messages that you want to affect.
Use Supported Browsers If you're using some other browser such as Opera, Maxthon, Vivaldi, etc., you might face problems such as being unable to download Gmail attachments. Additionally, for a seamless experience, Gmail also requires JavaScript and Cookies to be switched on in browsers.
Edit: Updated for DriveApp, as DocsList deprecated.
I suggest breaking this down into two problems. The first is how to get a pdf attachment from an email, the second is how to convert that pdf to text.
As you've found out, getContentAsString()
does not magically change a pdf attachment to plain text or html. We need to do something a little more complicated.
First, we'll get the attachment as a Blob
, a utility class used by several Services to exchange data.
var blob = attachments[0].getAs(MimeType.PDF);
So with the second problem separated out, and maintaining the assumption that we're interested in only the first attachment of the first message of each thread labeled templabel
, here is how myFunction()
looks:
/**
* Get messages labeled 'templabel', and send myself the text contents of
* pdf attachments in new emails.
*/
function myFunction() {
var threads = GmailApp.search('label:templabel');
var threadsMessages = GmailApp.getMessagesForThreads(threads);
for (var thread = 0; thread < threadsMessages.length; ++thread) {
var message = threadsMessages[thread][0];
var messageBody = message.getBody();
var messageSubject = message.getSubject();
var attachments = message.getAttachments();
var blob = attachments[0].getAs(MimeType.PDF);
var filetext = pdfToText( blob, {keepTextfile: false} );
GmailApp.sendEmail(Session.getActiveUser().getEmail(), messageSubject, filetext);
}
}
We're relying on a helper function, pdfToText()
, to convert our pdf blob
into text, which we'll then send to ourselves as a plain text email. This helper function has a variety of options; by setting keepTextfile: false
, we've elected to just have it return the text content of the PDF file to us, and leave no residual files in our Drive.
This utility is available as a gist. Several examples are provided there.
A previous answer indicated that it was possible to use the Drive API's insert
method to perform OCR, but it didn't provide code details. With the introduction of Advanced Google Services, the Drive API is easily accessible from Google Apps Script. You do need to switch on and enable the Drive API
from the editor, under Resources > Advanced Google Services
.
pdfToText()
uses the Drive service to generate a Google Doc from the content of the PDF file. Unfortunately, this contains the "pictures" of each page in the document - not much we can do about that. It then uses the regular DocumentService
to extract the document body as plain text.
/**
* See gist: https://gist.github.com/mogsdad/e6795e438615d252584f
*
* Convert pdf file (blob) to a text file on Drive, using built-in OCR.
* By default, the text file will be placed in the root folder, with the same
* name as source pdf (but extension 'txt'). Options:
* keepPdf (boolean, default false) Keep a copy of the original PDF file.
* keepGdoc (boolean, default false) Keep a copy of the OCR Google Doc file.
* keepTextfile (boolean, default true) Keep a copy of the text file.
* path (string, default blank) Folder path to store file(s) in.
* ocrLanguage (ISO 639-1 code) Default 'en'.
* textResult (boolean, default false) If true and keepTextfile true, return
* string of text content. If keepTextfile
* is false, text content is returned without
* regard to this option. Otherwise, return
* id of textfile.
*
* @param {blob} pdfFile Blob containing pdf file
* @param {object} options (Optional) Object specifying handling details
*
* @returns {string} id of text file (default) or text content
*/
function pdfToText ( pdfFile, options ) {
// Ensure Advanced Drive Service is enabled
try {
Drive.Files.list();
}
catch (e) {
throw new Error( "To use pdfToText(), first enable 'Drive API' in Resources > Advanced Google Services." );
}
// Set default options
options = options || {};
options.keepTextfile = options.hasOwnProperty("keepTextfile") ? options.keepTextfile : true;
// Prepare resource object for file creation
var parents = [];
if (options.path) {
parents.push( getDriveFolderFromPath (options.path) );
}
var pdfName = pdfFile.getName();
var resource = {
title: pdfName,
mimeType: pdfFile.getContentType(),
parents: parents
};
// Save PDF to Drive, if requested
if (options.keepPdf) {
var file = Drive.Files.insert(resource, pdfFile);
}
// Save PDF as GDOC
resource.title = pdfName.replace(/pdf$/, 'gdoc');
var insertOpts = {
ocr: true,
ocrLanguage: options.ocrLanguage || 'en'
}
var gdocFile = Drive.Files.insert(resource, pdfFile, insertOpts);
// Get text from GDOC
var gdocDoc = DocumentApp.openById(gdocFile.id);
var text = gdocDoc.getBody().getText();
// We're done using the Gdoc. Unless requested to keepGdoc, delete it.
if (!options.keepGdoc) {
Drive.Files.remove(gdocFile.id);
}
// Save text file, if requested
if (options.keepTextfile) {
resource.title = pdfName.replace(/pdf$/, 'txt');
resource.mimeType = MimeType.PLAIN_TEXT;
var textBlob = Utilities.newBlob(text, MimeType.PLAIN_TEXT, resource.title);
var textFile = Drive.Files.insert(resource, textBlob);
}
// Return result of conversion
if (!options.keepTextfile || options.textResult) {
return text;
}
else {
return textFile.id
}
}
The conversion to DriveApp is helped with this utility from Bruce McPherson:
// From: http://ramblings.mcpher.com/Home/excelquirks/gooscript/driveapppathfolder
function getDriveFolderFromPath (path) {
return (path || "/").split("/").reduce ( function(prev,current) {
if (prev && current) {
var fldrs = prev.getFoldersByName(current);
return fldrs.hasNext() ? fldrs.next() : null;
}
else {
return current ? null : prev;
}
},DriveApp.getRootFolder());
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With