While processing a file with pdfminer (pdf2txt.py) I received empty output: <pre class="prettyprint"><code>dan@work:~/project$ pdf2txt.py docs/homericaeast.pdf dan@work:~/project$ </code></pre> Can anybody say what wrong with this file and what I can do to get data from it? Here's <code>dumppdf.py docs/homericaeast.pdf</code> output: <pre class="prettyprint"><code><trailer> <dict size="4"> <key>Info</key> <value><ref id="2" /></value> <key>Root</key> <value><ref id="1" /></value> <key>ID</key> <value><list size="2"> <string size="16">on&#10;&#164;&#181;F&#164;5&#193;&#62;&#243;_&#253;v&#172;`</string> <string size="16">on&#10;&#164;&#181;F&#164;5&#193;&#62;&#243;_&#253;v&#172;`</string> </list></value> <key>Size</key> <value><number>27</number></value> </dict> </trailer> <trailer> <dict size="4"> <key>Info</key> <value><ref id="2" /></value> <key>Root</key> <value><ref id="1" /></value> <key>ID</key> <value><list size="2"> <string size="16">on&#10;&#164;&#181;F&#164;5&#193;&#62;&#243;_&#253;v&#172;`</string> <string size="16">on&#10;&#164;&#181;F&#164;5&#193;&#62;&#243;_&#253;v&#172;`</string> </list></value> <key>Size</key> <value><number>27</number></value> </dict> </trailer> </code></pre>

Now I have fixed the problem with <code>/OneByteIdentityH</code> similarly to the code for two byte unicode mapping <code>/Identity-H</code>. The patch is in PR #179

PDFminer empty output

Tags:

2 Answers

Now I have fixed the problem with /OneByteIdentityH similarly to the code for two byte unicode mapping /Identity-H. The patch is in PR #179

answered Oct 03 '22 13:10

hynekcer

The problem is that pdfminer doesn't understand the CMap that you are using in this PDF.

If you do a custom build of pdfminer switching STRICT=1 on in psparser.py you'll get an error a bit like this:

Click to copy

pdfminer.psparser.PSTypeError: Literal required: <PDFStream(21): raw=267, {u'Filter': /'FlateDecode', u'CMapName': /u'OneByteIdentityH', u'Type': /u'CMap', u'CIDSystemInfo': <PDFObjRef:20>, u'Length': 266}>

I'm not hugely familiar with the code, but even allowing this through doesn't help, because it doesn't recognize the mapping (even if I hard code the name to OneByteIdentityH and ask it to look that up). The net result is that the CMap contains no mappings and so it translates every character in your PDF to an empty string (well None if I'm being picky).

The fix is probably to create a mapping for this CMap that simply returns the character that was passed in similar to the other Identity maps already implemented in cmapdb.py

answered Oct 03 '22 13:10

Peter Brittain

Related questions
                            
                                Pass in a list of possible routes to Flask?
                            
                                tkk checkbutton appears when loaded up with black box in it
                            
                                How can I use tensorflow metric function within keras models?
                            
                                Groupby.transform doesn't work in dask dataframe
                            
                                How to give name to each node in celery
                            
                                Need help combining two 3 channel images into 6 channel image Python
                            
                                Sqlalchemy representation for custom postgres range type
                            
                                How to pass user object to forms in Django
                            
                                Changing extents or axis limits on complex Holoviews figures
                            
                                How to extract unique permutations from pandas DataSeries?
                            
                                How can I send a plot.ly image inline of an html email using smtp?
                            
                                What is the purpose of __table_args__ in sqlalchemy?
                            
                                pytest: run test from code, not from command line
                            
                                Printing value in each bin in hist2d (matplotlib)
                            
                                Combining rows to 'others' in pandas
                            
                                How can I request (get) and read an xml file using python?
                            
                                Reproducing LASSO / Logistic Regression results in R with Python using the Iris Dataset
                            
                                How to convert string labels to one-hot vectors in TensorFlow?
                            
                                Python: save attachments from .msg files
                            
                                How to organize a Python project with pickle files?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

PDFminer empty output

Tags:

python

pdf

pdfminer

pdf-parsing

Daniil Mashkin

People also ask

2 Answers

hynekcer

Peter Brittain

Recent Activity

Donate For Us