I have read a number of other stackoverflow answers and have yet to find a satisfactory answer to this, but it has been asked before. When I attempt to use PyPDF2 to read pdf documents it merges all of the words in a sentences into one continous string. Has anyone made any progess in figuring out how to avoid this. Below is the code
 import PyPDF2
 import pandas as pd
 import  struct as struct
 from nltk import word_tokenize
 pdfFileObj = open("notes.pdf", 'rb')
  pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
 ## reading pages fine
 print(type(pdfReader.numPages))
## read in the pages 
pageObj = pdfReader.getPage(0)
 print(pageObj.extractText())
below is a sample of the output
2)Explanationofthedifferencebetweenprobabilityandstatistics.Theroleofprobability
instatisticaldecisionmaking.ExamplesoftheuseofProbabilityinStatistics.
3)Datasummarization(graphicalandnumerical)
4)Probabilityandrandomvariables
                Never figured out how to remove the spaces, it is a very unwieldy program. I found the answer to use pdfMiner to be the most helpful. It is easy to understand and there exists better documentation. Below is a link for anyone having the same issue as myself.
http://survivalengineer.blogspot.ie/2014/04/parsing-pdfs-in-python.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With