Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting text from garbled PDF [closed]

I have a PDF file with valuable textual information.

The problem is that I cannot extract the text, all I get is a bunch of garbled symbols. The same happens if I copy and paste the text from the PDF reader to a text file. Even File -> Save as text in Acrobat Reader fails.

I have used all tools I could get my hands on and the result is the same. I believe that this has something to do with fonts embedding, but I don't know what exactly?

My questions:

  • What is the culprit of this weird text garbling?
  • How to extract the text content from the PDF (programmatically, with a tool, manipulating the bits directly, etc.)?
  • How to fix the PDF to not garble on copy?
like image 255
SNAG Avatar asked Aug 29 '12 18:08

SNAG


People also ask

How do I fix garbled text in PDF?

Open the PDF in Acrobat. Go to Tools>Edit > Scanned Documents >Settings. In the Scanned Document Editing Settings dialog box, deselect the Use available system font option. Click OK.

When I copy text from a PDF it is gibberish?

As mentioned, you are getting gibberish text when copying and pasting text from pdf, it seems the issue seems to be the font related. If the fonts of PDF don't have Unicode tables and do not use standard encoding for mapping the glyph indices to characters then you get garbage characters during copy/paste.


2 Answers

I had the same problem. Uploading it to Google Drive, opening with Google Docs and copying the text from there worked for me.

like image 106
knutson Avatar answered Sep 30 '22 04:09

knutson


Some PDF files are produced without special information that is crucial for successful extraction of text from them. Even by the Adobe tools. Basically, such files do not contain glyph-to-character mapping information.

Such files will be displayed and printed just fine (because shapes of the characters are properly defined), but text from them can't be properly copied / extracted (because there is no information about meaning of used glyphs/shapes).

For example, Distiller produces such files when "Smallest File Size" preset is used.

Other than OCR there is no other way to retrieve text from such files, I'm afraid. We recently published a guide for how to OCR PDFs in .NET.


Supplementing the original answer

The original answer mentioned the "information about meaning of used glyphs/shapes". This information should be contained in a PDF structure called a /ToUnicode table. Such a table is required for each and every font which is embedded as a subset and uses non-standard (Custom) encoding.

In order to quickly evaluate the chances for extractability of text contents, you can use the pdffonts command line utility. This prints in tabular form a series of items about each font used by the PDF. The presence of a /ToUnicode table is indicated by column headed uni.

A few example outputs:

$ kp@mbp:git.PDF101.angea> pdffonts handcoded/textextract/textextract-good.pdf

    name                     type        encoding   emb sub uni object ID
    ------------------------ ----------- ---------- --- --- --- ---------
    BAAAAA+Helvetica         TrueType    WinAnsi    yes yes yes     12  0
    CAAAAA+Helvetica-Bold    TrueType    WinAnsi    yes yes yes     13  0


$ kp@mbp:git.PDF101.angea> pdffonts handcoded/textextract/textextract-bad1.pdf

    name                     type        encoding   emb sub uni object ID
    ------------------------ ----------- ---------- --- --- --- ---------
    BAAAAA+Helvetica         TrueType    WinAnsi    yes yes no      12  0
    CAAAAA+Helvetica-Bold    TrueType    WinAnsi    yes yes no      13  0


$ kp@mbp:git.PDF101.angea> pdffonts handcoded/textextract/textextract-bad2.pdf

    name                     type        encoding   emb sub uni object ID
    ------------------------ ----------- ---------- --- --- --- ---------
    BAAAAA+Helvetica         TrueType    WinAnsi    yes yes yes     12  0
    CAAAAA+Helvetica-Bold    TrueType    WinAnsi    yes yes no      13  0

The good.pdf lets you extract the text contents for both fonts correctly, because both fonts have an accompanying /ToUnicode table.

For the bad1.pdf and the bad2.pdf the text extraction succeeds only for one of the two fonts, and fails for the other, because only one font has a /ToUnicode table.

I, Kurt Pfeifle, have recently created a series of hand-coded PDF files to demonstrate the influence of existing, buggy, manipulated or missing /ToUnicode tables in the PDF source code. These PDFs are extensively-commented and suitable to be explored with the help of a text editor. Above pdffonts output examples were created with the help of these hand-coded files. (There are a few more PDFs showing different results, which an interested reader may want to explore...)

like image 41
Bobrovsky Avatar answered Sep 30 '22 04:09

Bobrovsky