Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PDF Copy Text Issue: Weird Characters

Tags:

copy-paste

pdf

I tried to copy text from a PDF file but get some weird characters. Strangely, Okular can recoqnize the text, but not with Sumatra PDF or Adobe, all three applications are installed in Windows 10 64 bit. To better explain my issue, here is the video https://streamable.com/sw1hc. The "text layer workaround file" is one solution I got. Any help is greatly appreciated. Regards

like image 872
ariefcfa Avatar asked Apr 02 '19 15:04

ariefcfa


People also ask

Why does my PDF have strange characters?

When you see unreadable gibberish symbols as shown in the screenshot below, you are likely dealing with a corrupted PDF file. More specifically, your PDF document is probably missing important information about font character mapping. The reason for this can be that the document was produced incorrectly.

Why does PDF text look weird?

If you save a report in the PDF format, and the fonts look weird in the saved file - this may be caused by rare peculiarity local to your system. Basically, it's triggered by the absence (or manual removal) of the most widely-used Arial Font Family (causing the weird fonts to substitute it upon generating PDFs).

What do you do when a PDF is converted into garbled characters and symbols?

Use the “Print As Image” feature—Sometimes, printing the document shows the correct characters even when they aren't viewable on screen. Once printed out on paper, you may be able to read it. If it works, you can then OCR the digital file to create one that's readable and searchable.


1 Answers

In short: The (original) PDF does not contain the information required for regular text extraction as described in the PDF specification. Depending on the exact nature of your task, you might try to add the required information to the existing text objects and fonts or you might go for OCR.

Mapping character codes to Unicode as described in the PDF specification

The PDF specification ISO 32000-1 (and similarly ISO 32000-2, too) describes an algorithm for mapping character codes to Unicode values using information available directly inside the PDF.

It has been quoted very often in other stack overflow answers (see here, here, here, here, here, or here), so I won't quote it here again.

Essentially this is the algorithm used by Adobe Acrobat during copy&paste and also by many other text extractors.

In PDFs which don't contain the information required for text extraction, you eventually get to this point in the algorithm:

If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.

What happens if the algorithm above fails to produce a Unicode value

This is where the text extraction implementations differ, they try to determine the matching Unicode value by using heuristics or information from beyond the PDF or applying OCR to the glyph in question.

That the different programs you tried returned so different results shows that

  1. your PDF does not contain the information required for the algorithm above from the PDF specification and

  2. the heuristics used by those programs differ relevantly and Okular's heuristics work best for your document.

What to do in such a case

There are multiple options, more or less feasible depending on your concrete case:

  1. Ask the source of the PDF for a version that contains proper information for text extraction.

    Unless you have a contract with that source that requires them to supply the PDFs in a machine readable form or the source is otherwise obligated to do so, they usually will decline, though...

  2. Apply OCR to the PDF in question.

    Depending on the quality of the OCR software and the glyphs in the PDF, the results can be of a questionable quality; e.g. in your "PDF copy text issue-Text layer workaround.pdf" the header "Chapter 1: Derivative Securities" has been recognized as "Chapter1: Deratve Securites"...

  3. You can try to interactively add manually created ToUnicode maps to the PDF, e.g. as described by Tilman Hausherr in his answer to "how to add unicode in truetype0font on pdfbox 2.0.0".

    Depending on the number of different fonts you have to create the mappings for, this approach might easily require way too much time and effort...

like image 156
mkl Avatar answered Dec 06 '22 11:12

mkl