Where can I a mapping of Identity-H encoded characters to ASCII or Unicode characters?

Tags:

I have a PDF generated by a third party. I am trying to get the text out of it, but neither pdf2text nor copying and pasting results in readable text. After a little digging in the output (of either of two) I found that each character on the screen is made up of three bytes. For example, "A" is the bytes ef, 81, and 81. Looking at the metadata on the PDF it claims to be encoded in Identity-H, so I assume what I am seeing is a set of characters encoded in Identity-H. I have a partial mapping based on the documents I already have, but I want to make a more complete mapping. To do that I need something like an ASCII table for Identity-H.

300

asked Jun 19 '13 14:06

Chas. Owens

1 Answers

It is not always possible to extract text from a PDF especially when the /ToUnicode map is missing as pointed out by mkl.

If it is not possible to cut and paste the correct text from Acrobat then you will have very little chance of extracting the text yourself. If Acrobat cannot extract it then it is very unlikely that any other tool can extract the text correctly.

If you manually create an encoding table then you could use this to remap the extracted characters to their correct values but this most likely will only work for this one document.

Often this is done on purpose. I have seen documents that randomly remap characters differently for each font in the dot. It is used as a form of obfuscation and the only real way to extract text from these PDF's is to resort to OCR. There are many financial reports that use this type of trick to stop people from extracting their data.

Also, Identity-H is just a 1:1 character mapping for all characters from 0x0000 to 0xFFFF. ie. Identity is an identity mapping.

Your real problem is the missing /ToUnicode entry in this PDF. I suspect there is also an embedded CMap in your PDF that explains why there could be 3 bytes per character.

197

answered Sep 27 '22 23:09

Andrew Cash

Related questions
                            
                                Merge pdf files with numerical sort
                            
                                How can I enable dark mode when viewing a pdf file in firefox
                            
                                How to display pdf in php
                            
                                PDFBox setting A5 page size
                            
                                Getting Image from drawable and adding to PDF using iText
                            
                                How to install, test, convert, resize PDF using ImageMagick, Ghostscript, Windows Vista/7 x64
                            
                                Delphi XE2: Display PDF in a Delphi FireMonkey app on OSX
                            
                                PDF Parsing with SWIFT
                            
                                how to open pdf markup programmatically in ios 11
                            
                                JavaScript working in Acrobat but not Reader
                            
                                Print Pdf document Barcode generated by font in C#
                            
                                jsPDF addHTML exporting low quality image to PDF
                            
                                How to embed a Base64 encoded PDF data URI into a HTML 5 `<object>` data attribute?
                            
                                Manipulating fillable PDFs in Elixir/Erlang
                            
                                PDF-Forms with Unicode chars [closed]
                            
                                Custom page size in iTextSharp in C#.NET
                            
                                Show a pdf stream in a new window
                            
                                Chart from chart.js to pdf
                            
                                Changing the text and background color with Apple's PDFKit framework
                            
                                Render rdlc to pdf in azure website

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Where can I a mapping of Identity-H encoded characters to ASCII or Unicode characters?

Tags:

text

character-encoding

pdf

encoding

unicode

Chas. Owens

People also ask

1 Answers

Andrew Cash

Recent Activity

Donate For Us