Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to save text file in UTF-8 format using pdftotext

I am using pdftotext opensource tool to convert the PDF to text files. How can I save the text files in UTF-8 format so that I can retain all the accent characters in text files. I am using the below command to convert which extracts the content to text file but not able to see any accented characters.

pdftotext -enc UTF-8 book1.pdf book1.txt

Please help me to resolve this issue.

Thanks in advance,

like image 864
Amar Avatar asked Oct 28 '10 05:10

Amar


1 Answers

You can get a list of available encodings using the command:

pdftotext -listenc

and pick the right one using the -enc argument. Mine here seems to do UTF-8 by default. i.e. your "UTF-8" is superflous

pdftotext -enc UTF-8 your.pdf

You may want to check your locale (LC_ALL, LANG, ...).

EDIT: I downloaded the following PDF: http://www.i18nguy.com/unicode/unicodeexample.pdf

and converted it on a Windows 7 PC (german) and XPDF 3.02PL5 using the command:

pdftotext.exe -enc UTF-8 unicodeexample.pdf

The text file is definitely UTF-8 encoded, as all characters are displayed correctly. What are you using the text file for? If you're displaying it through a web application, your content encoding might simply be wrong, while the text file has been converted as you wanted it to.

Double-check using either a browser (force the encoding in Firefox to ISO-8859-1 and UTF-8) or using a hex editor.

like image 182
icanhasserver Avatar answered Nov 03 '22 15:11

icanhasserver