Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to cut-paste from PDF with non-ASCII encoding?

I have some PDFs and I am trying to cut and paste text they contain from Acrobat Reader into an HTML form. It seems that some of these files use (I suspect) unicode for text encoding, so when I try to paste into the HTML form (on firefox) I get the little boxes with hex chars in them rather than readable text. The problem is not that the PDF has not been OCRed -- when I try to do that in Acrobat Pro it says it can't because the file already contains renderable text. Is there any way to deal with this? For example could I add some sort of javascript to the form that would do conversion?

like image 619
Steve Avatar asked Feb 04 '12 18:02

Steve


People also ask

How can I copy text from PDF if content copying is not allowed?

If the PDF you received is protected by a password, use the designated password combination to open it, and then check the security settings to confirm that content copying is allowed. To do this, right-click the document and select Document Properties.

When I try to copy paste from PDF it is gibberish?

As mentioned, you are getting gibberish text when copying and pasting text from pdf, it seems the issue seems to be the font related. If the fonts of PDF don't have Unicode tables and do not use standard encoding for mapping the glyph indices to characters then you get garbage characters during copy/paste.

How can I copy text from encoding PDF?

Select the text in Acrobat. Right-click and select "Copy with formatting" from the context menu. Wait for the progress bar to process the text. Paste in the Word document.

How do I copy and paste without changing PDF format?

Copy the text: Choose Edit > Copy to copy the selected text to another application. Right-click on the selected text, and then select Copy. Right-click on the selected text, and then choose Copy With Formatting.


1 Answers

Are you able to paste text copied from the file into other programs like Notepad or Word or any other?

Some PDF files are produced without special information that is crucial for successful extraction of text from them. Even by the Adobe tools. Basically, such files do not contain glyph-to-character mapping information.

Such files will be displayed and printed just fine, but text from them can't be properly copied / extracted.

For example, Distiller produces such files when "Smallest File Size" preset is used.

like image 179
Bobrovsky Avatar answered Sep 21 '22 11:09

Bobrovsky