Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Error: cannot import name 'PDFDocument' from 'pdfminer.pdfparser'

I need to extract text from pdf-files and have used pdfminer.six with success, extracting both text paragraphs and tables. But now I get an error related to the line

from pdfminer.pdfparser import PDFParser, PDFDocument: 

ImportError: cannot import name 'PDFDocument' from 'pdfminer.pdfparser' (C:\Users[username]\Anaconda3\lib\site-packages\pdfminer\pdfparser.py)

I'm using Anaconda Jupyter. Python 3.7.3. Package pdfminer.six-20181108

The code I'm using is based on this: How to read pdf file using pdfminer3k?

Based on advice given below I've tried to uninstall and reinstall Anaconda and pdfminer.six and other packages several times: https://github.com/pdfminer/pdfminer.six/issues/196 A week ago it suddenly worked, but now I get an error again.

Since I'm working on Win10 I also tried using Linux Ubuntu as described here: https://medium.com/hugo-ferreiras-blog/using-windows-subsystem-for-linux-for-data-science-9a8e68d7610c

Same error.

Then, based on the webpage below I thought it was worth a try to split PDFparser, PDFDocument: from

from pdfminer.pdfparser import PDFParser, PDFDocument

to

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage

https://loctv.wordpress.com/2017/02/07/fix-importerror-cannot-import-name-pdfdocument-when-using-slate/ .. But that created new errors later on in the code.

The start of my code looks like this:

```
path = [name and path of file]
fp = open(path, 'rb')
from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox, LTTextLine
```

I expect to be able to run the code and extract the text from the pdf-file, but the code is stopped by the error relating to PDFDocument pdfminer.pdfparser

Any advice on what I should do is much appreciated! Might it has something to do with how pdfminer.six is installed?

like image 722
Ingeborg Avatar asked May 07 '19 13:05

Ingeborg


1 Answers

I got help from Notodden Serit. Change this:

from pdfminer.pdfparser import PDFParser, PDFDocument

to:

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage

And add parser in

doc = PDFDocument()

To:

doc = PDFDocument(parser)

And then:

for page in doc.get_pages():

To:

for page in PDFPage.create_pages(doc):
like image 171
Ingeborg Avatar answered Nov 15 '22 08:11

Ingeborg