I am writing a Master's thesis - NLP system. I have one component - extractor. It is extracting a plain text from PDF files. There are a few PDF files that can not be extracted correctly. Extractor (PDFBox library) returns a string like this: <blockquote> "┤xDn║if|d├gDF"Ti&cD╬lh d FÁhis~n ╗xd f«"d┤ffih »h" </blockquote> or <blockquote> "10a61a91a22a25a3a27a17a23a20a8a13a14a61a25a17" </blockquote> I was checking each file that makes this extraction's problem and all these files' text also can not be copy-pasted from PDF Reader (Adobe Reader and FoxIt reader). Viewing them in this readers is enabled, but after selecting its content and copying to the clipboard I get the same wrong text (as described above - strings of not semantically correct chars or strings of digits and letters). Could anybody help me???

Very often in such cases, where you can't select, copy'n'paste text from the Acrobat (Reader) window, there is another option which may work nevertheless: <ul> <li>Open 'File' menu,</li> <li>select 'Save as...',</li> <li>select 'Text (normal) (*.txt)',</li> <li>browse to the target directory,</li> <li>type the name you want to use for the text file.</li> </ul> You'll have all text from all pages in the file and need to locate the spot you wanted to copy'n'paste initially -- insofar it is not as comfortable as direct copy'n'paste. But it works more reliably.... It also works with <code>acroread</code> on Linux (but you have to choose 'Save as text...' from the file menu). <h3>Update</h3> You can use the <code>pdffonts</code> command line utility to get a quick-shot analysis of the fonts used by a PDF. Here is an example output, which demonstrates where a problem for text extraction will very likely occur. It uses one of these hand-coded PDF files from a GitHub-Repository which was created to provide PDF sample files which are well commented and may easily be opened in a text editor: <pre class="prettyprint"><code>$ pdffonts textextract-bad2.pdf name type encoding emb sub uni object ID ------------------------------- ------------ ----------- --- --- --- --------- BAAAAA+Helvetica TrueType WinAnsi yes yes yes 12 0 CAAAAA+Helvetica-Bold TrueType WinAnsi yes yes no 13 0 </code></pre> How to interpret this table? <ul> <li>The above PDF file uses two subsetted fonts (as indicated by the <code>BAAAAA+</code> and <code>CAAAAA+</code> prefixes to their names, as well as by the <code>yes</code> entries in the <code>sub</code> column), <code>Helvetica</code> and <code>Helvtica-Bold</code>.</li> <li>Both fonts are of type <code>TrueType</code>.</li> <li>Both fonts use a <code>WinAnsi</code> encoding (a font encoding maps char identifiers used in the PDF source code to glyphs that should be drawn). However, only for font <code>/Helvetica</code> there is a <code>/ToUnicode</code> table available inside the PDF (for <code>/Helvetica-Bold</code> there is none), as indicated by the <code>yes</code>/<code>no</code> in the <code>uni</code>-column).</li> </ul> The <code>/ToUnicode</code> table is required to provide a reverse mapping from character identifiers/codes to characters. A missing <code>/ToUnicode</code> table for a specific font is almost always a sure indicator that text strings using this font cannot be extracted or copied'n'pasted from the PDF. (Even if a <code>/ToUnicode</code> table is there, text extraction may still pose a problem, because this table may be damaged, incorrect or incomplete -- as seen in many real-world PDF files, and as also demonstrated by a few companion files in the above linked GitHub repository.)

Copy+pasting text from PDF results in garbage

Tags:

pdf

pdfbox

I am writing a Master's thesis - NLP system. I have one component - extractor.

It is extracting a plain text from PDF files. There are a few PDF files that can not be extracted correctly. Extractor (PDFBox library) returns a string like this:

"┤xDn║if|d├gDF"Ti&cD╬lh d FÁhis~n ╗xd f«"d┤ffih »h"

"10a61a91a22a25a3a27a17a23a20a8a13a14a61a25a17"

I was checking each file that makes this extraction's problem and all these files' text also can not be copy-pasted from PDF Reader (Adobe Reader and FoxIt reader). Viewing them in this readers is enabled, but after selecting its content and copying to the clipboard I get the same wrong text (as described above - strings of not semantically correct chars or strings of digits and letters).

Could anybody help me???

574

asked May 28 '10 01:05

Michal_R

2 Answers

Very often in such cases, where you can't select, copy'n'paste text from the Acrobat (Reader) window, there is another option which may work nevertheless:

Open 'File' menu,
select 'Save as...',
select 'Text (normal) (*.txt)',
browse to the target directory,
type the name you want to use for the text file.

You'll have all text from all pages in the file and need to locate the spot you wanted to copy'n'paste initially -- insofar it is not as comfortable as direct copy'n'paste. But it works more reliably....

It also works with acroread on Linux (but you have to choose 'Save as text...' from the file menu).

Update

You can use the pdffonts command line utility to get a quick-shot analysis of the fonts used by a PDF.

Here is an example output, which demonstrates where a problem for text extraction will very likely occur. It uses one of these hand-coded PDF files from a GitHub-Repository which was created to provide PDF sample files which are well commented and may easily be opened in a text editor:

$ pdffonts  textextract-bad2.pdf
  name                            type         encoding    emb sub uni object ID
  ------------------------------- ------------ ----------- --- --- --- ---------
  BAAAAA+Helvetica                TrueType     WinAnsi     yes yes yes     12  0
  CAAAAA+Helvetica-Bold           TrueType     WinAnsi     yes yes no      13  0

How to interpret this table?

The above PDF file uses two subsetted fonts (as indicated by the BAAAAA+ and CAAAAA+ prefixes to their names, as well as by the yes entries in the sub column), Helvetica and Helvtica-Bold.
Both fonts are of type TrueType.
Both fonts use a WinAnsi encoding (a font encoding maps char identifiers used in the PDF source code to glyphs that should be drawn). However, only for font /Helvetica there is a /ToUnicode table available inside the PDF (for /Helvetica-Bold there is none), as indicated by the yes/no in the uni-column).

The /ToUnicode table is required to provide a reverse mapping from character identifiers/codes to characters.

A missing /ToUnicode table for a specific font is almost always a sure indicator that text strings using this font cannot be extracted or copied'n'pasted from the PDF. (Even if a /ToUnicode table is there, text extraction may still pose a problem, because this table may be damaged, incorrect or incomplete -- as seen in many real-world PDF files, and as also demonstrated by a few companion files in the above linked GitHub repository.)

answered Oct 12 '22 17:10

Kurt Pfeifle

If are able to successfully select and copy the text in Adobe Reader -- indicated that the PDF does contain text objects -- but you can't paste the copied text into Notepad without it looking like a bunch of garbage characters, then the problem is probably related to the CMap that the selected text uses.

The PDF specification provides many options for the display of textual content and the related extraction of the text content. A CMap specifies the mapping from character codes to character selectors. The PDF spec outlines some predefined CMaps, but other CMaps can also be embedded.

My guess is that either the CMap for this text is corrupt or that the PDFBox library doesn't support this particular CMap. I suggest trying a different SDK just to see if you get any different results.

answered Oct 12 '22 17:10

Rowan

Related questions
                            
                                How can one embed a font into a PDF with free linux command line tools? [closed]
                            
                                PDFlib for php, is there an alternative [closed]
                            
                                CGPDFDocumentRef from NSData
                            
                                How to get the total number of pages in MPDF?
                            
                                Symfony2 serve open pdf file
                            
                                Spring - display PDF-file in browser instead of downloading
                            
                                Is there any wkhtmltopdf option to convert html text rather than file?
                            
                                How to set Visual Studio to Publish pdf files automatically
                            
                                convert HTML ( having Javascript ) to PDF using JavaScript [closed]
                            
                                <object> PDF not scrollable on mobile
                            
                                Generate PDF from .docx generated by PHPWord
                            
                                Extract images from PDF using python PyPDF2
                            
                                Drawing vector images on PDF with PDFBox
                            
                                phantomjs fit content to A4 page
                            
                                PDF hyperlinks on iPhone/iPad
                            
                                Using PDFBox to write UTF-8 encoded strings to a PDF [duplicate]
                            
                                jsPDF justify text
                            
                                How to sign PDF with a x.509 signature/certificate
                            
                                Python Data Extraction from an Encrypted PDF
                            
                                Jekyll documentation to PDF with TOC

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With