Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Where is file needed for PDFTOTEXT output in UTF-8 format?

I want to use the XPDF-based PDFTOTEXT command-line tool to look at PDF files, hoping to get UTF-8 output. I have seen others on StackOverflow getting it -- questions 4039930, 3809761 and 13618330 show that others have been able to use it.

When I use the option -enc utf-8 these messages are displayed:

Syntax Error: Couldn't find unicodeMap file for the 'utf-8' encoding
Config Error: Couldn't get text encoding

I've seen documentation that (among others) UTF-8 encoding is "predefined" but I cannot find the file that I need to point to. (I've looked at multiple different downloads of XPDF-based software and have not yet found it.)

Any pointers would be appreciated.

EDIT: I am on Windows.

like image 862
J.Merrill Avatar asked Nov 21 '13 17:11

J.Merrill


1 Answers

You should use UTF-8 instead utf-8. See pdftotext help message:

$ pdftotext -listenc
Available encodings are:
UCS-2
ASCII7
Latin1
UTF-8
ZapfDingbats
Symbol

Proof code:

$ pdftotext -eol unix -nopgbrk -layout -enc utf-8 file.pdf
Syntax Error: Couldn't find unicodeMap file for the 'utf-8' encoding
Command Line Error: Couldn't get text encoding
$ pdftotext -eol unix -nopgbrk -layout -enc UTF-8 file.pdf
$ echo $?
0
like image 188
Artem Klevtsov Avatar answered Oct 13 '22 11:10

Artem Klevtsov