Whitespace gone from PDF extraction, and strange word interpretation

Tags:

Using the snippet below, I've attempted to extract the text data from this PDF file.

import pyPdf

def get_text(path):
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    # Iterate pages
    content = ""
    for i in range(0, pdf.getNumPages()):
        content += pdf.getPage(i).extractText() + "\n"  # Extract text from page and add to content
    # Collapse whitespace
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
    return content

The output I obtain, however,is devoid of whitespace between most of the words. This makes it difficult to perform natural language processing on the text (my ultimate goal, here).

Also, the 'fi' in the word 'finger' is consistently interpreted as something else. This is rather problematic since this paper is about spontaneous finger movements...

Does anybody know why this might be happening? I don't even know where to start!

999

asked Jun 18 '12 17:06

Louis Thibault

2 Answers

As an alternative to PyPDF2, I suggest pdftotext:

#!/usr/bin/env python

"""Use pdftotext to extract text from PDFs."""

import pdftotext

with open("foobar.pdf") as f:
    pdf = pdftotext.PDF(f)

# Iterate over all the pages
for page in pdf:
    print(page)

133

answered Oct 04 '22 22:10

Martin Thoma

PyPDF doesnt read newline charecter.

So use PyPDF4

Install it using

pip install PyPDF4

and use this code as an example

import PyPDF4
import re
import io

pdfFileObj = open(r'3134.pdf', 'rb')
pdfReader = PyPDF4.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(1)
pages_text = pageObj.extractText()

for line in pages_text.split('\n'):
    #if re.match(r"^PDF", line):
    print(line)

answered Oct 04 '22 21:10

prathik shirolkar

Related questions
                            
                                pythonic way to create 3d dict
                            
                                Could you explain more detailed differences between mod_wsgi and werkzeug? (SOS newbies)
                            
                                Python list comprehension expensive
                            
                                How to use python logging in multiple modules
                            
                                Add a dynamic form to a django formset using javascript in a right way
                            
                                Could not start uwsgi process
                            
                                finding and replacing 'nan' with a number
                            
                                Remove first encountered elements from a list
                            
                                Django migration relation does not exist
                            
                                fps - how to divide count by time function to determine fps
                            
                                Spyder Not Launching
                            
                                How to set the running file path of jupyter in VScode?
                            
                                PyQt5 Designer is not working: This application failed to start because no Qt platform plugin could be initialized
                            
                                Unable to import opengl.gl in python on macos
                            
                                Most Efficient Way to Find Whether a Large List Contains a Specific String (Python)
                            
                                Replace non-numeric characters
                            
                                Check for a key pattern in a dictionary in python
                            
                                How to reduce an image size in image processing (scipy/numpy/python)
                            
                                Django: Foreign Key relation with User Table does not validate
                            
                                Python: Return tuple or list?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Whitespace gone from PDF extraction, and strange word interpretation

Tags:

python

pdf

unicode

pypdf

Louis Thibault

People also ask

2 Answers

Martin Thoma

prathik shirolkar

Recent Activity

Donate For Us