How to extract text from PDF in JavaSript

2 Answers

This is an ancient question, but because pdf.js has been developing over the years, I would like to give a new answer. That is, it can be done locally without involving any server or external service. The new pdf.js has a function: page.getTextContent(). You can get the text content from that. I've done it successfully with the following code.

What you get in each step is a promise. You need to code this way: .then( function(){...}) to proceed to the next step.

1) PDFJS.getDocument( data ).then( function(pdf) {

2) pdf.getPage(i).then( function(page){

3) page.getTextContent().then( function(textContent){
What you finally get is an string array textContent.bidiTexts[]. You concatenate them to get the text of 1 page. Text blocks' coordinates are used to judge whether newline or space need to be inserted. (This may not be totally robust, but from my test it seems ok.)
The input parameter data needs to be either a URL or ArrayBuffer type data. I used the ReadAsArrayBuffer(file) function in FileReader API to get the data.

Hope this helps.

Note: According to some other user, the library has updated and caused the code to break. According to the comment by async5 below, you need to replace textContent.bidiTexts with textContent.items.

    function Pdf2TextClass(){      var self = this;      this.complete = 0;      /**      *      * @param data ArrayBuffer of the pdf file content      * @param callbackPageDone To inform the progress each time      *        when a page is finished. The callback function's input parameters are:      *        1) number of pages done;      *        2) total number of pages in file.      * @param callbackAllDone The input parameter of callback function is       *        the result of extracted text from pdf file.      *      */      this.pdfToText = function(data, callbackPageDone, callbackAllDone){      console.assert( data  instanceof ArrayBuffer  || typeof data == 'string' );      PDFJS.getDocument( data ).then( function(pdf) {      var div = document.getElementById('viewer');       var total = pdf.numPages;      callbackPageDone( 0, total );              var layers = {};              for (i = 1; i <= total; i++){         pdf.getPage(i).then( function(page){         var n = page.pageNumber;         page.getTextContent().then( function(textContent){           if( null != textContent.bidiTexts ){             var page_text = "";             var last_block = null;             for( var k = 0; k < textContent.bidiTexts.length; k++ ){                 var block = textContent.bidiTexts[k];                 if( last_block != null && last_block.str[last_block.str.length-1] != ' '){                     if( block.x < last_block.x )                         page_text += "\r\n";                      else if ( last_block.y != block.y && ( last_block.str.match(/^(\s?[a-zA-Z])$|^(.+\s[a-zA-Z])$/) == null ))                         page_text += ' ';                 }                 page_text += block.str;                 last_block = block;             }              textContent != null && console.log("page " + n + " finished."); //" content: \n" + page_text);             layers[n] =  page_text + "\n\n";           }           ++ self.complete;           callbackPageDone( self.complete, total );           if (self.complete == total){             window.setTimeout(function(){               var full_text = "";               var num_pages = Object.keys(layers).length;               for( var j = 1; j <= num_pages; j++)                   full_text += layers[j] ;               callbackAllDone(full_text);             }, 1000);                         }         }); // end  of page.getTextContent().then       }); // end of page.then     } // of for   });  }; // end of pdfToText() }; // end of class

137

answered Sep 20 '22 00:09

gm2008

I couldn't get gm2008's example to work (the internal data structure on pdf.js has changed apparently), so I wrote my own fully promise-based solution that doesn't use any DOM elements, queryselectors or canvas, using the updated pdf.js from the example at mozilla

It eats a file path for the upload since i'm using it with node-webkit. You need to make sure you have the cmaps downloaded and pointed somewhere and you nee pdf.js and pdf.worker.js to get this working.

    /**      * Extract text from PDFs with PDF.js      * Uses the demo pdf.js from https://mozilla.github.io/pdf.js/getting_started/      */     this.pdfToText = function(data) {          PDFJS.workerSrc = 'js/vendor/pdf.worker.js';         PDFJS.cMapUrl = 'js/vendor/pdfjs/cmaps/';         PDFJS.cMapPacked = true;          return PDFJS.getDocument(data).then(function(pdf) {             var pages = [];             for (var i = 0; i < pdf.numPages; i++) {                 pages.push(i);             }             return Promise.all(pages.map(function(pageNumber) {                 return pdf.getPage(pageNumber + 1).then(function(page) {                     return page.getTextContent().then(function(textContent) {                         return textContent.items.map(function(item) {                             return item.str;                         }).join(' ');                     });                 });             })).then(function(pages) {                 return pages.join("\r\n");             });         });     }

usage:

 self.pdfToText(files[0].path).then(function(result) {       console.log("PDF done!", result);  })

answered Sep 20 '22 00:09

SchizoDuckie

Related questions
                            
                                Equivalent of Python's dir in Javascript
                            
                                Do let statements create properties on the global object?
                            
                                Support for the experimental syntax 'jsx' isn't currently enabled
                            
                                What Is A Text Node, Its Uses? //document.createTextNode()
                            
                                Does CoffeeScript allow JavaScript-style == equality semantics?
                            
                                Creating a javascript widget for other sites
                            
                                Template literals with nested backticks(`) in ES6
                            
                                Angular2 too many file requests on load
                            
                                How to set File objects and length property at FileList object where the files are also reflected at FormData object?
                            
                                What's the difference in using toString() compared to JSON.stringify()?
                            
                                Preventing HTML and Script injections in Javascript
                            
                                Handling connection loss with websockets
                            
                                Sort two arrays the same way
                            
                                Does it make sense to minify code used in NodeJS?
                            
                                How to create an Array with AngularJS's ng-model
                            
                                How do I pass a URL with multiple parameters into a URL?
                            
                                Cross Domain Resource Sharing GET: 'refused to get unsafe header "etag"' from Response
                            
                                AngularJS How to dynamically add HTML and bind to controller
                            
                                How to calculate the amount of flexbox items in a row?
                            
                                Jest mock inner function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to extract text from PDF in JavaSript

Tags:

javascript

text

pdf

nacho4d

People also ask

2 Answers

gm2008

SchizoDuckie

Recent Activity

Donate For Us