While processing a file with pdfminer (pdf2txt.py) I received empty output:
dan@work:~/project$ pdf2txt.py docs/homericaeast.pdf
dan@work:~/project$
Can anybody say what wrong with this file and what I can do to get data from it?
Here's dumppdf.py docs/homericaeast.pdf
output:
<trailer>
<dict size="4">
<key>Info</key>
<value><ref id="2" /></value>
<key>Root</key>
<value><ref id="1" /></value>
<key>ID</key>
<value><list size="2">
<string size="16">on ¤µF¤5Á>ó_ýv¬`</string>
<string size="16">on ¤µF¤5Á>ó_ýv¬`</string>
</list></value>
<key>Size</key>
<value><number>27</number></value>
</dict>
</trailer>
<trailer>
<dict size="4">
<key>Info</key>
<value><ref id="2" /></value>
<key>Root</key>
<value><ref id="1" /></value>
<key>ID</key>
<value><list size="2">
<string size="16">on ¤µF¤5Á>ó_ýv¬`</string>
<string size="16">on ¤µF¤5Á>ó_ýv¬`</string>
</list></value>
<key>Size</key>
<value><number>27</number></value>
</dict>
</trailer>
To extract text from a PDF file using PDFMiner in Python, we can open the PDF file and then we use TextConverter to convert the text into a string. to open the example. pdf file with open . Then we create the PDFParser object with the in_file .
PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.
Pdfminer. six is a community maintained fork of the original PDFMiner. It is a tool for extracting information from PDF documents. It focuses on getting and analyzing text data.
Now I have fixed the problem with /OneByteIdentityH
similarly to the code for two byte unicode mapping /Identity-H
. The patch is in PR #179
The problem is that pdfminer
doesn't understand the CMap that you are using in this PDF.
If you do a custom build of pdfminer switching STRICT=1
on in psparser.py
you'll get an error a bit like this:
pdfminer.psparser.PSTypeError: Literal required: <PDFStream(21): raw=267, {u'Filter': /'FlateDecode', u'CMapName': /u'OneByteIdentityH', u'Type': /u'CMap', u'CIDSystemInfo': <PDFObjRef:20>, u'Length': 266}>
I'm not hugely familiar with the code, but even allowing this through doesn't help, because it doesn't recognize the mapping (even if I hard code the name to OneByteIdentityH
and ask it to look that up). The net result is that the CMap contains no mappings and so it translates every character in your PDF to an empty string (well None
if I'm being picky).
The fix is probably to create a mapping for this CMap that simply returns the character that was passed in similar to the other Identity maps already implemented in cmapdb.py
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With