Parsing PDF files in Hadoop Map Reduce

Tags:

I have to parse PDF files , that are in HDFS in a Map Reduce Program in Hadoop. So i get the PDF file from HDFS as Input splits and it has to be parsed and sent to the Mapper Class. For implementing this InputFormat I had gone through this link . How can the these input splits be parsed and converted into text format ?

806

asked Feb 24 '12 08:02

WR10

1 Answers

Processing PDF files in Hadoop can be done by extending FileInputFormat Class. Let the class extending it be WholeFileInputFormat. In the WholeFileInputFormat class you override the getRecordReader() method. Now each pdf will be received as an Individual Input Split. Then these individual splits can be parsed to extract the text. This link gives a clear example of understanding how to extend FileInputFormat.

145

answered Nov 15 '22 04:11

WR10

Related questions
                            
                                Draw auto-resized image in PDF file with PDFBox
                            
                                Adding a Digital signature to a PDF with iTextSharp
                            
                                Python PIL can't open PDFs for some reason
                            
                                Convert Blob to File and preview on browser (not download)
                            
                                Embedded Font Error for Rails Prawn Document
                            
                                Remove PDF metadata (removing complete PDF metadata )
                            
                                PAdES LTV signing of a PDF/A-3A document yields invalid signature
                            
                                How do I paint Swing Components to a PDF file with iText?
                            
                                Previewing PDF and PowerPoint files with Silverlight/Flash
                            
                                Silverlight/C# web application - send PDF to client's printer without opening
                            
                                "Publish" to pdf
                            
                                iphone certain PDFs rendering as black image
                            
                                How can I generate a PDF from HTML without the PDFlib library?
                            
                                How To Annotate A PDF in iPad
                            
                                PDF Cross Reference Streams
                            
                                PDF Text search C# [closed]
                            
                                How to get PDF annotations when i touch on ipad screen
                            
                                Remove verso and list of tables from DocBook document
                            
                                Set PDF Version using iTextSharp
                            
                                Alter PDF - Text repositioning

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Parsing PDF files in Hadoop Map Reduce

Tags:

pdf

hadoop

mapreduce

pdf-parsing

WR10

People also ask

1 Answers

WR10

Recent Activity

Donate For Us