Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PyPDF2 insists on removing all the spaces [duplicate]

Tags:

python

pypdf2

I have read a number of other stackoverflow answers and have yet to find a satisfactory answer to this, but it has been asked before. When I attempt to use PyPDF2 to read pdf documents it merges all of the words in a sentences into one continous string. Has anyone made any progess in figuring out how to avoid this. Below is the code

 import PyPDF2
 import pandas as pd

 import  struct as struct

 from nltk import word_tokenize

 pdfFileObj = open("notes.pdf", 'rb')

  pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

 ## reading pages fine
 print(type(pdfReader.numPages))

## read in the pages 
pageObj = pdfReader.getPage(0)

 print(pageObj.extractText())

below is a sample of the output

2)Explanationofthedifferencebetweenprobabilityandstatistics.Theroleofprobability
instatisticaldecisionmaking.ExamplesoftheuseofProbabilityinStatistics.
3)Datasummarization(graphicalandnumerical)

4)Probabilityandrandomvariables
like image 554
Steve Avatar asked Apr 28 '16 12:04

Steve


1 Answers

Never figured out how to remove the spaces, it is a very unwieldy program. I found the answer to use pdfMiner to be the most helpful. It is easy to understand and there exists better documentation. Below is a link for anyone having the same issue as myself.

http://survivalengineer.blogspot.ie/2014/04/parsing-pdfs-in-python.html

like image 147
Steve Avatar answered Nov 10 '22 04:11

Steve