Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I read PDF or Word Docs with Node.js?

I can't find any packages to do this. I know PHP has a ton of libraries for PDFs (like http://www.fpdf.org/) but anything for Node?

like image 790
Shamoon Avatar asked Jan 27 '12 18:01

Shamoon


People also ask

How do I read a Word document in node JS?

You can use Aspose. Words Cloud SDK for Node. js to extract text from DOC/DOCX,Open Office, and PDF. It's paid API but the free plan provides 150 free monthly API calls.

How do I read a DOCX file in node?

First install NodeJS file system. Second is pdf reader. Install Xlsx for reading Xls, xlsx workbooks. node-stream-zip is to read doc and Docx file.


4 Answers

textract is a great lib that supports PDFs, Doc, Docx, etc.

like image 176
James_1x0 Avatar answered Oct 20 '22 17:10

James_1x0


Looks like there's a few for pdf, but I didn't find any for Word.

CPU bound processing like that isn't really Node's strong point anyway (i.e. you get no additional benefits using node to do it over any other language). A pragmatic approach would be to find a good tool and utilise it from Node.

I have heard good things around the office about docsplit http://documentcloud.github.com/docsplit/

While it's not Node, you could easily invoke it from Node with http://nodejs.org/docs/latest/api/all.html#child_process.exec

like image 33
timoxley Avatar answered Oct 20 '22 16:10

timoxley


You can easily convert one into another, or use for example a .doc template to generate a .pdf file, but you will probably want to use an existing web service for this task.

This can be done using the services of Livedocx for example

To use this service from node, see node-livedocx (Disclaimer: I am the author of this node module)

like image 6
Tim Avatar answered Oct 20 '22 16:10

Tim


I would suggest looking into unoconv for your initial conversion, this uses LibreOffice or OpenOffice for the actual conversion. Which adds some overhead.

I'd setup a few workers with all the necessities setup, and use a request/response queue for handling the conversion... (may want to look into kue or zmq)

In general this is a CPU bound and heavy task that should be offloaded... Pandoc and others specifically mention .docx, not .doc so they may or may not be options as well.


Note: I know this question is old, just wanted to provide a current answer for others coming across this.

like image 4
Tracker1 Avatar answered Oct 20 '22 17:10

Tracker1