Reading pdf files with python 3.6

Tags:

Is there any way for opening and reading a pdf file with python 3.6? I tried to read a pdf file with a couple of libraries and tools such as PyPDF2 and pdfrw, but none of them can extract the textual content of a pdf document. Any kind of help will be appreciated.

435

asked Dec 13 '17 15:12

amiref

1 Answers

Try: PyMuPDF

Python recipe: PDF TEXT EXTRACTION USING FITZ / MUPDF (PYMUPDF):

    #!/usr/bin/env python
"""
Created on Wed Jul 29 07:00:00 2015

@author: Jorj McKie
Copyright (c) 2015 Jorj X. McKie

The license of this program is governed by the GNU GENERAL PUBLIC LICENSE
Version 3, 29 June 2007. See the "COPYING" file of this repository.

This is an example for using the Python binding PyMuPDF of MuPDF.

This program extracts the text of an input PDF and writes it in a text file.
The input file name is provided as a parameter to this script (sys.argv[1])
The output file name is input-filename appended with ".txt".
Encoding of the text in the PDF is assumed to be UTF-8.
Change the ENCODING variable as required.
-------------------------------------------------------------------------------
"""
import fitz                 # this is PyMuPDF
import sys, json

ENCODING = "UTF-8"

def SortBlocks(blocks):
    '''
    Sort the blocks of a TextPage in ascending vertical pixel order,
    then in ascending horizontal pixel order.
    This should sequence the text in a more readable form, at least by
    convention of the Western hemisphere: from top-left to bottom-right.
    If you need something else, change the sortkey variable accordingly ...
    '''

    sblocks = []
    for b in blocks:
        x0 = str(int(b["bbox"][0]+0.99999)).rjust(4,"0") # x coord in pixels
        y0 = str(int(b["bbox"][1]+0.99999)).rjust(4,"0") # y coord in pixels
        sortkey = y0 + x0                                # = "yx"
        sblocks.append([sortkey, b])
    sblocks.sort()
    return [b[1] for b in sblocks] # return sorted list of blocks

def SortLines(lines):
    ''' Sort the lines of a block in ascending vertical direction. See comment
    in SortBlocks function.
    '''
    slines = []
    for l in lines:
        y0 = str(int(l["bbox"][1] + 0.99999)).rjust(4,"0")
        slines.append([y0, l])
    slines.sort()
    return [l[1] for l in slines]

def SortSpans(spans):
    ''' Sort the spans of a line in ascending horizontal direction. See comment
    in SortBlocks function.
    '''
    sspans = []
    for s in spans:
        x0 = str(int(s["bbox"][0] + 0.99999)).rjust(4,"0")
        sspans.append([x0, s])
    sspans.sort()
    return [s[1] for s in sspans]

#==============================================================================
# Main Program
#==============================================================================
ifile = sys.argv[1]
ofile = ifile + ".txt"

doc = fitz.Document(ifile)
pages = doc.pageCount
fout = open(ofile,"w")

for i in range(pages):
    pg_text = ""                                 # initialize page text buffer
    pg = doc.loadPage(i)                         # load page number i
    text = pg.getText(output = 'json')           # get its text in JSON format
    pgdict = json.loads(text)                    # create a dict out of it
    blocks = SortBlocks(pgdict["blocks"])        # now re-arrange ... blocks
    for b in blocks:
        lines = SortLines(b["lines"])            # ... lines
        for l in lines:
            spans = SortSpans(l["spans"])        # ... spans
            for s in spans:
                # ensure that spans are separated by at least 1 blank
                # (should make sense in most cases)
                if pg_text.endswith(" ") or s["text"].startswith(" "):
                    pg_text += s["text"]
                else:
                    pg_text += " " + s["text"]
            pg_text += "\n"                      # separate lines by newline

    pg_text = pg_text.encode(ENCODING, "ignore")
    fout.write(pg_text)

fout.close()

179

answered Oct 07 '22 10:10

Branko Petrović

Related questions
                            
                                gcloud compute scp error: All sources must be local files
                            
                                Signing documents with iText 7 and GlobalSign DSS in .NET C#
                            
                                "Meteor create my-app" taking forever installing npm dependencies
                            
                                Parsing a huge text file(around 2GB) with custom delimiters
                            
                                Non overlapping error bars in line plot
                            
                                HTTP url redirects as HTTPS on selenium test run
                            
                                How to store a object array by using Room in Android?
                            
                                Importing Android and iOS Libraries into a Flutter Project
                            
                                Express req.query always empty
                            
                                How to use an enum value in an :if validation
                            
                                Adding Disabled Attribute to dynamically created Button In React
                            
                                How to tell if an object has a given prototype?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Reading pdf files with python 3.6

Tags:

python

python-3.x

pdf

amiref

People also ask

1 Answers

Branko Petrović

Recent Activity

Donate For Us