I am writing a Master's thesis - NLP system. I have one component - extractor.
It is extracting a plain text from PDF files. There are a few PDF files that can not be extracted correctly. Extractor (PDFBox library) returns a string like this:
"┤xDn║if|d├gDF"Ti&cD╬lh d FÁhis~n ╗xd f«"d┤ffih »h"
or
"10a61a91a22a25a3a27a17a23a20a8a13a14a61a25a17"
I was checking each file that makes this extraction's problem and all these files' text also can not be copy-pasted from PDF Reader (Adobe Reader and FoxIt reader). Viewing them in this readers is enabled, but after selecting its content and copying to the clipboard I get the same wrong text (as described above - strings of not semantically correct chars or strings of digits and letters).
Could anybody help me???
Copy the text: Choose Edit > Copy to copy the selected text to another application. Right-click on the selected text, and then select Copy. Right-click on the selected text, and then choose Copy With Formatting.
Open the PDF in Acrobat. Go to Tools>Edit > Scanned Documents >Settings. In the Scanned Document Editing Settings dialog box, deselect the Use available system font option. Click OK.
The original author of the file may not have wanted other users to copy information directly out of the document and thus used a PDF editor to restrict those functions. Again, you should contact the original author where possible if you need many passages from the text.
When you see scrambled text, dots, odd characters, or white blocks that look like tofu, it means that the PDF doesn't have the original fonts embedded. 3 solutions, take your pick of which one will work for your situation: Install the missing fonts on the computer where you're viewing the PDF.
we need some specialised converter that may convert the hindi text to unicode. It can either be the text in the that is wrong or the PDF Viewer does not have the correct font to process the text for copying. If the PDF is created from a scanned document, then it maybe that the text is wrong, try to re-OCR the document.
Very often in such cases, where you can't select, copy'n'paste text from the Acrobat (Reader) window, there is another option which may work nevertheless:
You'll have all text from all pages in the file and need to locate the spot you wanted to copy'n'paste initially -- insofar it is not as comfortable as direct copy'n'paste. But it works more reliably....
It also works with acroread
on Linux (but you have to choose 'Save as text...' from the file menu).
You can use the pdffonts
command line utility to get a quick-shot analysis of the fonts used by a PDF.
Here is an example output, which demonstrates where a problem for text extraction will very likely occur. It uses one of these hand-coded PDF files from a GitHub-Repository which was created to provide PDF sample files which are well commented and may easily be opened in a text editor:
$ pdffonts textextract-bad2.pdf
name type encoding emb sub uni object ID
------------------------------- ------------ ----------- --- --- --- ---------
BAAAAA+Helvetica TrueType WinAnsi yes yes yes 12 0
CAAAAA+Helvetica-Bold TrueType WinAnsi yes yes no 13 0
How to interpret this table?
BAAAAA+
and CAAAAA+
prefixes to their names, as well as by the yes
entries in the sub
column), Helvetica
and Helvtica-Bold
.TrueType
.WinAnsi
encoding (a font encoding maps char identifiers used in the PDF source code to glyphs that should be drawn).
However, only for font /Helvetica
there is a /ToUnicode
table available inside the PDF (for /Helvetica-Bold
there is none), as indicated by the yes
/no
in the uni
-column).The /ToUnicode
table is required to provide a reverse mapping from character identifiers/codes to characters.
A missing /ToUnicode
table for a specific font is almost always a sure indicator that text strings using this font cannot be extracted or copied'n'pasted from the PDF. (Even if a /ToUnicode
table is there, text extraction may still pose a problem, because this table may be damaged, incorrect or incomplete -- as seen in many real-world PDF files, and as also demonstrated by a few companion files in the above linked GitHub repository.)
If are able to successfully select and copy the text in Adobe Reader -- indicated that the PDF does contain text objects -- but you can't paste the copied text into Notepad without it looking like a bunch of garbage characters, then the problem is probably related to the CMap that the selected text uses.
The PDF specification provides many options for the display of textual content and the related extraction of the text content. A CMap specifies the mapping from character codes to character selectors. The PDF spec outlines some predefined CMaps, but other CMaps can also be embedded.
My guess is that either the CMap for this text is corrupt or that the PDFBox library doesn't support this particular CMap. I suggest trying a different SDK just to see if you get any different results.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With