PyPDF2 insists on removing all the spaces [duplicate]

Question

I have read a number of other stackoverflow answers and have yet to find a satisfactory answer to this, but it has been asked before. When I attempt to use PyPDF2 to read pdf documents it merges all of the words in a sentences into one continous string. Has anyone made any progess in figuring out how to avoid this. Below is the code

 import PyPDF2
 import pandas as pd

 import  struct as struct

 from nltk import word_tokenize

 pdfFileObj = open("notes.pdf", 'rb')

  pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

 ## reading pages fine
 print(type(pdfReader.numPages))

## read in the pages 
pageObj = pdfReader.getPage(0)

 print(pageObj.extractText())

below is a sample of the output

2)Explanationofthedifferencebetweenprobabilityandstatistics.Theroleofprobability
instatisticaldecisionmaking.ExamplesoftheuseofProbabilityinStatistics.
3)Datasummarization(graphicalandnumerical)

4)Probabilityandrandomvariables

Steve · Accepted Answer

Never figured out how to remove the spaces, it is a very unwieldy program. I found the answer to use pdfMiner to be the most helpful. It is easy to understand and there exists better documentation. Below is a link for anyone having the same issue as myself.

http://survivalengineer.blogspot.ie/2014/04/parsing-pdfs-in-python.html

PyPDF2 insists on removing all the spaces [duplicate]

Tags:

python

pypdf2

Steve

1 Answers

Steve

Recent Activity

Donate For Us

PyPDF2 insists on removing all the spaces [duplicate]

Tags:

python

pypdf2

Steve

1 Answers

Steve

Related questions

Recent Activity

Donate For Us