Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PDF to Text extractor in nodejs without OS dependencies

Is there a way to extract text from PDFs in nodejs without any OS dependencies (like pdf2text, or xpdf on windows)? I wasn't able to find any 'native' pdf packages in nodejs. They always are a wrapper/util on top of an existing OS command. Thanks

like image 581
bartium Avatar asked Jun 09 '15 13:06

bartium


2 Answers

Have you checked PDF2Json? It is built on top of PDF.js. Though it is not providing the text output as a single line but I believe you may just reconstruct the final text based on the generated Json output:

'Texts': an array of text blocks with position, actual text and styling informations: 'x' and 'y': relative coordinates for positioning 'clr': a color index in color dictionary, same 'clr' field as in 'Fill' object. If a color can be found in color dictionary, 'oc' field will be added to the field as 'original color" value. 'A': text alignment, including: left center right 'R': an array of text run, each text run object has two main fields: 'T': actual text 'S': style index from style dictionary. More info about 'Style Dictionary' can be found at 'Dictionary Reference' section

like image 130
Eugene Avatar answered Oct 02 '22 02:10

Eugene


Thought I'd chime in here for anyone who came across this question in the future. I had this problem and spent hours over literally all the PDF libraries on NPM. My requirements were that I needed to run it on AWS Lambda so could not depend on OS dependencies.

The code below is adapted from another stackoverflow answer (which I cannot currently find). The only difference being that we import the ES5 version which works with Node >= 12. If you just import pdfjs-dist there will be an error of "Readable Stream is not defined". Hope it helps!

import * as pdfjslib from 'pdfjs-dist/es5/build/pdf.js';

export default class Pdf {
  public static async getPageText(pdf: any, pageNo: number) {
    const page = await pdf.getPage(pageNo);
    const tokenizedText = await page.getTextContent();
    const pageText = tokenizedText.items.map((token: any) => token.str).join('');
    return pageText;
  }

  public static async getPDFText(source: any): Promise<string> {
    const pdf = await pdfjslib.getDocument(source).promise;
    const maxPages = pdf.numPages;
    const pageTextPromises = [];
    for (let pageNo = 1; pageNo <= maxPages; pageNo += 1) {
      pageTextPromises.push(Pdf.getPageText(pdf, pageNo));
    }
    const pageTexts = await Promise.all(pageTextPromises);
    return pageTexts.join(' ');
  }
}

Usage

const fileBuffer = fs.readFile('sample.pdf');
const pdfText = await Pdf.getPDFText(fileBuffer);
like image 44
Josh Avatar answered Oct 02 '22 02:10

Josh