Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read pdf file using pdfminer3k?

I am using python 3.5 and I want to read the text, line by line from pdf files. Was trying to use pdfminer3k but not getting proper syntax anywhere. How to use it correctly?

like image 614
poshita singh Avatar asked May 17 '17 12:05

poshita singh


People also ask

How does PDF miner work?

PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.

What is LAParams in Pdfminer?

LAParams. Parameters: line_overlap – If two characters have more overlap than this they are considered to be on the same line. The overlap is specified relative to the minimum height of both characters.

What is pdfminer?

PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows to obtain the exact location of texts in a page, as well as other information such as fonts or lines.

How do I extract text from a PDF file using pdfminer?

On Windows machines which don't have make command, paste the following commands on a command line prompt: PDFMiner comes with two handy tools: pdf2txt.py and dumppdf.py. pdf2txt.py extracts text contents from a PDF file. It extracts all the text that are to be rendered programmatically, i.e. text represented as ASCII or Unicode strings.

Which is better pypdf2 or pdfminer six?

PDFminer.six works more reliably than PyPDF2 (which fails with certain types of PDFs), in particular PDF version 1.7 However, text extraction with PDFminer.six is significantly slower than PyPDF2 by a factor of 6.

What is the best tool to extract text from a PDF?

PDFMiner's structure changed recently, so this should work for extracting text from the PDF files. Edit : Still working as of the June 7th of 2018. Verified in Python Version 3.x Edit: The solution works with Python 3.7 at October 3, 2019. I used the Python library pdfminer.six, released on November 2018.


1 Answers

I have corrected Lisa's code. It works now!

    fp = open(path, 'rb')
    from pdfminer.pdfparser import PDFParser, PDFDocument
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
    from pdfminer.converter import PDFPageAggregator
    from pdfminer.layout import LAParams, LTTextBox, LTTextLine

    parser = PDFParser(fp)
    doc = PDFDocument()
    parser.set_document(doc)
    doc.set_parser(parser)
    doc.initialize('')
    rsrcmgr = PDFResourceManager()
    laparams = LAParams()
    laparams.char_margin = 1.0
    laparams.word_margin = 1.0
    device = PDFPageAggregator(rsrcmgr, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    extracted_text = ''

    for page in doc.get_pages():
        interpreter.process_page(page)
        layout = device.get_result()
        for lt_obj in layout:
            if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):
                extracted_text += lt_obj.get_text()
like image 182
Matphy Avatar answered Sep 24 '22 23:09

Matphy