I am trying to parse a pdf file containing Indian voters list which is in hindi(Devanagari script). PDF displays all the text correctly but when I tried dumping this pdf into text format using PDFminer it output the characters which are different from the original pdf characters For example Displayed/Correct word is सामान्य But the output word is सपमपनद Now I want to know why this is happening and how do I correctly parse this type of pdf file I am also including the sample pdf file- http://164.100.180.82/Rollpdf/AC276/S24A276P001.pdf

This issue is very similar to the one discussed in this answer, and the appearance of the sample document there does also remind of the document here. Just like in the case of the document in that other question, the ToUnicode map of the Devanagari script font used in the document here maps multiple completely different glyphs to identical Unicode code points. Thus, text extraction based on this mapping is bound to fail, and most text extractors rely on these information, especially in the absence of an font Encoding entry like here. <hr> Some text extractors can use the mapping of glyph to Unicode contained in the embedded font program (if present). But checking this mapping in the Devanagari script font program used in the document here, it turns out that it associates most glyphs with U+f020 through U+f062 named "uniF020" etc. <img src="https://i.stack.imgur.com/uhIsu.png" alt="Compact UnicodeBmp"> These Unicode code points are located in the Unicode Private Use Area, i.e. they do not have a standardized meaning but applications may use them as they like. Thus, text extractors using the Unicode mapping contained in the font program wouldn't deliver immediately intelligible text either. <hr> There is one fact, though, which can help you to mostly automatize text extraction from this document nonetheless: The same PDF object is referenced for the Devanagari script font on multiple pages, so on all pages referencing the same PDF object the same original character identifier or the same font program private use Unicode code point refer to the same visual symbol. In case of your document I counted only 5 copies of the font. Thus, if you find a text extractor which either returns the character identifier (ignoring all toUnicode maps) or returns the private use area Unicode code points from the font program, you can use its output and merely replace each entry according to a few maps. <hr> I had not yet have use for such a text extractor, so I don't know any in the python context. But who knows, probably pdfminer or any other similar package can be told (by some option) to ignore the misleading ToUnicode map and then be used as outlined above.

Parsing a pdf(Devanagari script) using PDFminer gives incorrect output [duplicate]

1 Answers

This issue is very similar to the one discussed in this answer, and the appearance of the sample document there does also remind of the document here.

Just like in the case of the document in that other question, the ToUnicode map of the Devanagari script font used in the document here maps multiple completely different glyphs to identical Unicode code points. Thus, text extraction based on this mapping is bound to fail, and most text extractors rely on these information, especially in the absence of an font Encoding entry like here.

Some text extractors can use the mapping of glyph to Unicode contained in the embedded font program (if present). But checking this mapping in the Devanagari script font program used in the document here, it turns out that it associates most glyphs with U+f020 through U+f062 named "uniF020" etc.

Compact UnicodeBmp

These Unicode code points are located in the Unicode Private Use Area, i.e. they do not have a standardized meaning but applications may use them as they like.

Thus, text extractors using the Unicode mapping contained in the font program wouldn't deliver immediately intelligible text either.

There is one fact, though, which can help you to mostly automatize text extraction from this document nonetheless: The same PDF object is referenced for the Devanagari script font on multiple pages, so on all pages referencing the same PDF object the same original character identifier or the same font program private use Unicode code point refer to the same visual symbol. In case of your document I counted only 5 copies of the font.

Thus, if you find a text extractor which either returns the character identifier (ignoring all toUnicode maps) or returns the private use area Unicode code points from the font program, you can use its output and merely replace each entry according to a few maps.

I had not yet have use for such a text extractor, so I don't know any in the python context. But who knows, probably pdfminer or any other similar package can be told (by some option) to ignore the misleading ToUnicode map and then be used as outlined above.

167

answered Oct 26 '22 23:10

mkl

Related questions
                            
                                Python lazy evaluation numpy ndarray
                            
                                How to subclass a class that has a __new__ and relies on the value of cls?
                            
                                django settings.py os.environ.get("X") not fetching correct values
                            
                                Is it possible to set the buffer size of pipes when using asyncio subprocesses?
                            
                                pandas read_csv: ignore trailing lines with empty data
                            
                                How to be confident that Python 2.7.10 Doesn't break my Python 2.7.6 Code?
                            
                                "print_level =-1" doesn't remove all messages
                            
                                python's requests do not handle cookies correctly?
                            
                                Python regex match fails with UTF-8 characters
                            
                                How to solve/what is a KeyError in Python/Pandas?
                            
                                How to get matplotlib and latex work together?
                            
                                Swapping key-value pairs in a dictionary
                            
                                How do I improve remove duplicate algorithm?
                            
                                How to ship or distribute a matplotlib stylesheet
                            
                                How do you add httplib2 into ansible?
                            
                                Why is PyCharm unable to find the correct verion of pip to install a Python module?
                            
                                python logging close and application exit
                            
                                numpy cross-correlation - vectorizing
                            
                                python nested lists/dictionaries and popping values
                            
                                Sign up using G+ (Google+) using kivy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Parsing a pdf(Devanagari script) using PDFminer gives incorrect output [duplicate]

Tags:

python

parsing

pdf

pdfminer

hindi

Rohit

People also ask

1 Answers

mkl

Recent Activity

Donate For Us