Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

LESSCHARSET=utf-8 less doesn't seem to work

Tags:

unix

utf-8

I'm trying to view a UTF-8 text file/stream in less, and even if I invoke it like this:

cat file | LESSCHARSET=utf-8 less

the non-ASCII compatible UTF-8 characters don't display correctly. Instead, their hex values appear highlighted in brackets, e.g. <F4>.

The reading the same text in vim with UTF-8 encoding poses no problems. So I'm thinking something is wrong with the way I'm invoking less.

My locale output is the following

LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=

My less version is the one installed by XCode on OSX Leopard:

$ less --version | sed 's/^/    /'
less 394
Copyright (C) 1984-2005 Mark Nudelman

less comes with NO WARRANTY, to the extent permitted by law.
For information about the terms of redistribution, 
see the file named README in the less distribution.
Homepage: http://www.greenwoodsoftware.com/less

locale -a | grep US | sed 's/^/ /' outputs the following:

en_AU.US-ASCII
en_CA.US-ASCII
en_GB.US-ASCII
en_NZ.US-ASCII
en_US
en_US.ISO8859-1
en_US.ISO8859-15
en_US.US-ASCII
en_US.UTF-8
like image 980
dan Avatar asked Jan 22 '10 04:01

dan


People also ask

Does UTF-8 support Cyrillic?

UTF-8. 128 characters are encoded using 1 byte (the ASCII characters). 1920 characters are encoded using 2 bytes (Roman, Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic characters).

Is UTF-8 better than UTF-16?

UTF-16 is better where ASCII is not predominant, since it uses 2 bytes per character, primarily. UTF-8 will start to use 3 or more bytes for the higher order characters where UTF-16 remains at just 2 bytes for most characters.

What is the difference between UTF-8 and UTF-8?

There is no difference between "utf8" and "utf-8"; they are simply two names for UTF8, the most common Unicode encoding.


2 Answers

  1. What does the locale command output? Is it a UTF-8 locale?

  2. Are you sure your terminal is set to display UTF-8? Does echo -e '\xe2\x82\xac' produce the € (euro) sign?

  3. Is the locale that you have set even installed on the system? Is it present in the list that locale -a outputs?

  4. What version of less are you using? (Run less --version to find out.) Really, really old versions did not even support LESSCHARSET. This is less likely to be the case, because I have a Debian "sarge" system with less version 382, and it does not even need LESSCHARSET if the locale is set correctly.

like image 135
Teddy Avatar answered Sep 23 '22 06:09

Teddy


My guess is that your file isn't UTF8 but rather ISO8859. (Is the <F4> character supposed to be a 'ô'?)

Start an xterm with LANG=en_US.ISO-8859-1 xterm. Then verify the locale (the output of locale should be something like en_US.ISO-8859-1). Then use less to view the file. Does it display correctly?

Note that it isn't enough to just use LESSCHARSET=iso8859 without starting a new terminal. LESSCHARSET tells less to think that the terminal can interpret iso8859, but your terminal probably displays UTF8, since the euro sign displays correctly. But as \xf4 isn't a valid utf8 character, the terminal will probably show something like '�'.

like image 33
Michael Closson Avatar answered Sep 19 '22 06:09

Michael Closson