Using the snippet below, I've attempted to extract the text data from this PDF file.
import pyPdf
def get_text(path):
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
content = ""
for i in range(0, pdf.getNumPages()):
content += pdf.getPage(i).extractText() + "\n" # Extract text from page and add to content
# Collapse whitespace
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
The output I obtain, however,is devoid of whitespace between most of the words. This makes it difficult to perform natural language processing on the text (my ultimate goal, here).
Also, the 'fi' in the word 'finger' is consistently interpreted as something else. This is rather problematic since this paper is about spontaneous finger movements...
Does anybody know why this might be happening? I don't even know where to start!
The spaces and "fi" were lost in the translation from text to PDF and they're not coming back. @Ned Batchelder, Thanks for your reply! Could you clarify what you mean by "assuming multi-character runs are words"?
PDF data could be tricky to deal with in a data science project. For example, you try to extract text from PDF for a Natural Language Processing (NLP) project, you might experience missing whitespace between words or separating whole words with random whitespaces. You can’t develop any meaningful NLP models without correct whitespace between words.
PDFBox is a pretty good tool for extracting text from PDF files using Java. Text extraction is its strength; if you want to modify/annotate or view PDF files, another tool might serve you better. It has code for identifying spaces in files.
- Foxit Blog Occasionally, you may open a PDF file and find that it displays strange symbols, weird letters, or unintelligible characters. With some files, it might happen when opened in one PDF software but not another, and with other files it might happen regardless of the PDF software being used.
As an alternative to PyPDF2, I suggest pdftotext
:
#!/usr/bin/env python
"""Use pdftotext to extract text from PDFs."""
import pdftotext
with open("foobar.pdf") as f:
pdf = pdftotext.PDF(f)
# Iterate over all the pages
for page in pdf:
print(page)
PyPDF doesnt read newline charecter.
So use PyPDF4
Install it using
pip install PyPDF4
and use this code as an example
import PyPDF4
import re
import io
pdfFileObj = open(r'3134.pdf', 'rb')
pdfReader = PyPDF4.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(1)
pages_text = pageObj.extractText()
for line in pages_text.split('\n'):
#if re.match(r"^PDF", line):
print(line)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With