Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Whitespace gone from PDF extraction, and strange word interpretation

Using the snippet below, I've attempted to extract the text data from this PDF file.

import pyPdf

def get_text(path):
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    # Iterate pages
    content = ""
    for i in range(0, pdf.getNumPages()):
        content += pdf.getPage(i).extractText() + "\n"  # Extract text from page and add to content
    # Collapse whitespace
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
    return content

The output I obtain, however,is devoid of whitespace between most of the words. This makes it difficult to perform natural language processing on the text (my ultimate goal, here).

Also, the 'fi' in the word 'finger' is consistently interpreted as something else. This is rather problematic since this paper is about spontaneous finger movements...

Does anybody know why this might be happening? I don't even know where to start!

like image 999
Louis Thibault Avatar asked Jun 18 '12 17:06

Louis Thibault


People also ask

What happened to the spaces in the text to PDF translation?

The spaces and "fi" were lost in the translation from text to PDF and they're not coming back. @Ned Batchelder, Thanks for your reply! Could you clarify what you mean by "assuming multi-character runs are words"?

How to deal with PDF data in data science projects?

PDF data could be tricky to deal with in a data science project. For example, you try to extract text from PDF for a Natural Language Processing (NLP) project, you might experience missing whitespace between words or separating whole words with random whitespaces. You can’t develop any meaningful NLP models without correct whitespace between words.

How can I extract text from a PDF file using Java?

PDFBox is a pretty good tool for extracting text from PDF files using Java. Text extraction is its strength; if you want to modify/annotate or view PDF files, another tool might serve you better. It has code for identifying spaces in files.

Why are weird characters showing up in my PDF?

- Foxit Blog Occasionally, you may open a PDF file and find that it displays strange symbols, weird letters, or unintelligible characters. With some files, it might happen when opened in one PDF software but not another, and with other files it might happen regardless of the PDF software being used.


2 Answers

As an alternative to PyPDF2, I suggest pdftotext:

#!/usr/bin/env python

"""Use pdftotext to extract text from PDFs."""

import pdftotext

with open("foobar.pdf") as f:
    pdf = pdftotext.PDF(f)

# Iterate over all the pages
for page in pdf:
    print(page)
like image 133
Martin Thoma Avatar answered Oct 04 '22 22:10

Martin Thoma


PyPDF doesnt read newline charecter.

So use PyPDF4

Install it using

pip install PyPDF4

and use this code as an example

import PyPDF4
import re
import io

pdfFileObj = open(r'3134.pdf', 'rb')
pdfReader = PyPDF4.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(1)
pages_text = pageObj.extractText()

for line in pages_text.split('\n'):
    #if re.match(r"^PDF", line):
    print(line)
like image 31
prathik shirolkar Avatar answered Oct 04 '22 21:10

prathik shirolkar