Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse PDF in Node.js

I am using meteor-react for uploading PDF docs to my Node.js backend, where I want to read the uploaded PDF doc, as a json, or whatever. Is it possible? And what library/tool would you recommended for that? Thank you!

like image 244
peter Avatar asked Jan 03 '18 08:01

peter


People also ask

Can I parse a PDF?

A PDF Parser (also sometimes called PDF scraper) is a software that can be used to extract data from PDF documents. PDF Parsers can come in form of libraries for developers or as standalone software products for end-users. PDF Parsers are used mainly to extract data from a batch of PDF files.

How does PDF parser work?

A PDF parser or scraper is an application that identifies the different types of elements in a PDF file and extracts them for your use. So, how does PDF parser work? A PDF parser goes down to the foundational blocks of a PDF document and uses an algorithm to identify the types of data included in the document.

How do you read a PDF with a protractor?

Pre-requisite: To read pdf file, first we need to install pdf2json library. Prerequisite : System should have nodejs Installed. Open COMMAND prompt and go to directory . Write npm i pdf2json and press enter.

How to read text from a PDF file in Node JS?

To read the text from the pdf file, we will use the pdf parse package in node. The pdf parse is a javascript-based module that works cross-platform and helps you extract texts from PDF files. Head over to terminal, on the command prompt type the command and press enter to form the folder for building node app.

What backend do you use to parse PDFs?

This runs Node.js as a backend and uses PDF.js, from Mozilla Labs, to parse PDFs. A full-text index is also built, the beginning of a larger ingestion process. This task splits into three pieces. I run a separate server for each – I’m not sure whether the Node.js community has a preferred architecture, but this feels like a natural fit.

How do I parse a PDF file?

This instance has two methods for parsing a PDF. They return the same output and only differ in the input: PdfReader.parseFileItems for a filename, and PdfReader.parseBuffer from data that we don’t want to reference from the filesystem. The methods ask for a callback, which gets called each time the PdfReader finds what it denotes as a PDF item.

How to extract text from a PDF file using JavaScript?

The pdf parse is a javascript-based module that works cross-platform and helps you extract texts from PDF files. Head over to terminal, on the command prompt type the command and press enter to form the folder for building node app.


1 Answers

There are a couple of Node packages for parsing PDF:

  1. pdf2json: https://www.npmjs.com/package/pdf2json
  2. pdfreader: https://www.npmjs.com/package/pdfreader

Check out their Github and documentation pages. It appears to me that pdf2json is a more complete solution, while pdfreader might be easier to get started with. You'll have to experiment and choose based on your project requirements.

like image 139
Arash Motamedi Avatar answered Oct 01 '22 10:10

Arash Motamedi