Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing PDF files in Hadoop Map Reduce

I have to parse PDF files , that are in HDFS in a Map Reduce Program in Hadoop. So i get the PDF file from HDFS as Input splits and it has to be parsed and sent to the Mapper Class. For implementing this InputFormat I had gone through this link . How can the these input splits be parsed and converted into text format ?

like image 806
WR10 Avatar asked Feb 24 '12 08:02

WR10


People also ask

What is MapReduce PDF?

MapReduce is a simple and powerful programming model which enables development of scalable parallel applications to process large amount of data which is scattered on a cluster of machines.

What is map and Reduce in Hadoop?

MapReduce is a Hadoop framework used for writing applications that can process vast amounts of data on large clusters. It can also be called a programming model in which we can process large datasets across computer clusters. This application allows data to be stored in a distributed form.

Does Hadoop use MapReduce?

MapReduce is a programming model or pattern within the Hadoop framework that is used to access big data stored in the Hadoop File System (HDFS). It is a core component, integral to the functioning of the Hadoop framework.


1 Answers

Processing PDF files in Hadoop can be done by extending FileInputFormat Class. Let the class extending it be WholeFileInputFormat. In the WholeFileInputFormat class you override the getRecordReader() method. Now each pdf will be received as an Individual Input Split. Then these individual splits can be parsed to extract the text. This link gives a clear example of understanding how to extend FileInputFormat.

like image 145
WR10 Avatar answered Nov 15 '22 04:11

WR10