Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to correctly extract text from a pdf using pdf.js

I'm new to ES6 and Promise. I'm trying pdf.js to extract texts from all pages of a pdf file into a string array. And when extraction is done, I want to parse the array somehow. Say pdf file(passed via typedarray correctly) has 4 pages and my code is:

let str = [];
PDFJS.getDocument(typedarray).then(function(pdf) {
  for(let i = 1; i <= pdf.numPages; i++) {
    pdf.getPage(i).then(function(page) {
      page.getTextContent().then(function(textContent) {
        for(let j = 0; j < textContent.items.length; j++) {
          str.push(textContent.items[j].str);
        }
        parse(str);
      });
    });
  }
});

It manages to work, but, of course, the problem is my parse function is called 4 times. I just want to call parse only after all 4-pages-extraction is done.

like image 874
Sangbok Lee Avatar asked Nov 16 '16 15:11

Sangbok Lee


People also ask

How do I extract specific text from a PDF?

Once you've opened the file, click on the "Edit" tab, and then click on the "edit" icon. Now you can right-click on the text and select "Copy" to extract the text you need.

Is there a way to extract data from a PDF?

You can extract data from PDF files directly into Excel. First, you'll need to import your PDF file. Once you import the file, use the extract data button to begin the extraction process. You should see several instruction windows that will help you extract the selected data.


2 Answers

Similar to https://stackoverflow.com/a/40494019/1765767 -- collect page promises using Promise.all and don't forget to chain then's:

function gettext(pdfUrl){
  var pdf = pdfjsLib.getDocument(pdfUrl);
  return pdf.then(function(pdf) { // get all pages text
    var maxPages = pdf.pdfInfo.numPages;
    var countPromises = []; // collecting all page promises
    for (var j = 1; j <= maxPages; j++) {
      var page = pdf.getPage(j);

      var txt = "";
      countPromises.push(page.then(function(page) { // add page promise
        var textContent = page.getTextContent();
        return textContent.then(function(text){ // return content promise
          return text.items.map(function (s) { return s.str; }).join(''); // value page text 
        });
      }));
    }
    // Wait for all pages and join text
    return Promise.all(countPromises).then(function (texts) {
      return texts.join('');
    });
  });
}

// waiting on gettext to finish completion, or error
gettext("https://cdn.mozilla.net/pdfjs/tracemonkey.pdf").then(function (text) {
  alert('parse ' + text);
}, 
function (reason) {
  console.error(reason);
});
<script src="https://npmcdn.com/pdfjs-dist/build/pdf.js"></script>
like image 115
async5 Avatar answered Sep 19 '22 13:09

async5


A bit more cleaner version of @async5 and updated according to the latest version of "pdfjs-dist": "^2.0.943"

import PDFJS from "pdfjs-dist";
import PDFJSWorker from "pdfjs-dist/build/pdf.worker.js"; // add this to fit 2.3.0

PDFJS.disableTextLayer = true;
PDFJS.disableWorker = true; // not availaible anymore since 2.3.0 (see imports)

const getPageText = async (pdf: Pdf, pageNo: number) => {
  const page = await pdf.getPage(pageNo);
  const tokenizedText = await page.getTextContent();
  const pageText = tokenizedText.items.map(token => token.str).join("");
  return pageText;
};

/* see example of a PDFSource below */
export const getPDFText = async (source: PDFSource): Promise<string> => {
  Object.assign(window, {pdfjsWorker: PDFJSWorker}); // added to fit 2.3.0
  const pdf: Pdf = await PDFJS.getDocument(source).promise;
  const maxPages = pdf.numPages;
  const pageTextPromises = [];
  for (let pageNo = 1; pageNo <= maxPages; pageNo += 1) {
    pageTextPromises.push(getPageText(pdf, pageNo));
  }
  const pageTexts = await Promise.all(pageTextPromises);
  return pageTexts.join(" ");
};

This is the corresponding typescript declaration file that I have used if anyone needs it.

declare module "pdfjs-dist";

type TokenText = {
  str: string;
};

type PageText = {
  items: TokenText[];
};

type PdfPage = {
  getTextContent: () => Promise<PageText>;
};

type Pdf = {
  numPages: number;
  getPage: (pageNo: number) => Promise<PdfPage>;
};

type PDFSource = Buffer | string;

declare module 'pdfjs-dist/build/pdf.worker.js'; // needed in 2.3.0

Example of how to get a PDFSource from a File with Buffer (from node types) :

file.arrayBuffer().then((ab: ArrayBuffer) => {
  const pdfSource: PDFSource = Buffer.from(ab);
});
like image 31
Atul Avatar answered Sep 19 '22 13:09

Atul