Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PDFminer empty output

While processing a file with pdfminer (pdf2txt.py) I received empty output:

dan@work:~/project$ pdf2txt.py  docs/homericaeast.pdf 

dan@work:~/project$ 

Can anybody say what wrong with this file and what I can do to get data from it?

Here's dumppdf.py docs/homericaeast.pdf output:

<trailer>
<dict size="4">
<key>Info</key>
<value><ref id="2" /></value>
<key>Root</key>
<value><ref id="1" /></value>
<key>ID</key>
<value><list size="2">
<string size="16">on&#10;&#164;&#181;F&#164;5&#193;&#62;&#243;_&#253;v&#172;`</string>
<string size="16">on&#10;&#164;&#181;F&#164;5&#193;&#62;&#243;_&#253;v&#172;`</string>
</list></value>
<key>Size</key>
<value><number>27</number></value>
</dict>
</trailer>

<trailer>
<dict size="4">
<key>Info</key>
<value><ref id="2" /></value>
<key>Root</key>
<value><ref id="1" /></value>
<key>ID</key>
<value><list size="2">
<string size="16">on&#10;&#164;&#181;F&#164;5&#193;&#62;&#243;_&#253;v&#172;`</string>
<string size="16">on&#10;&#164;&#181;F&#164;5&#193;&#62;&#243;_&#253;v&#172;`</string>
</list></value>
<key>Size</key>
<value><number>27</number></value>
</dict>
</trailer>
like image 529
Daniil Mashkin Avatar asked May 07 '17 14:05

Daniil Mashkin


People also ask

How do I extract text from PDFMiner?

To extract text from a PDF file using PDFMiner in Python, we can open the PDF file and then we use TextConverter to convert the text into a string. to open the example. pdf file with open . Then we create the PDFParser object with the in_file .

How does PDFMiner work?

PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.

What is the difference between PDFMiner and PDFMiner six?

Pdfminer. six is a community maintained fork of the original PDFMiner. It is a tool for extracting information from PDF documents. It focuses on getting and analyzing text data.


2 Answers

Now I have fixed the problem with /OneByteIdentityH similarly to the code for two byte unicode mapping /Identity-H. The patch is in PR #179

like image 82
hynekcer Avatar answered Oct 03 '22 13:10

hynekcer


The problem is that pdfminer doesn't understand the CMap that you are using in this PDF.

If you do a custom build of pdfminer switching STRICT=1 on in psparser.py you'll get an error a bit like this:

pdfminer.psparser.PSTypeError: Literal required: <PDFStream(21): raw=267, {u'Filter': /'FlateDecode', u'CMapName': /u'OneByteIdentityH', u'Type': /u'CMap', u'CIDSystemInfo': <PDFObjRef:20>, u'Length': 266}>

I'm not hugely familiar with the code, but even allowing this through doesn't help, because it doesn't recognize the mapping (even if I hard code the name to OneByteIdentityH and ask it to look that up). The net result is that the CMap contains no mappings and so it translates every character in your PDF to an empty string (well None if I'm being picky).

The fix is probably to create a mapping for this CMap that simply returns the character that was passed in similar to the other Identity maps already implemented in cmapdb.py

like image 44
Peter Brittain Avatar answered Oct 03 '22 13:10

Peter Brittain