Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

what's the difference among various types of 'utf-8' in emacs

In Emacs, after typing

M-x revert-buffer-with-coding-system

I could see many types of 'utf-8', for example, utf-8, utf-8-auto-unix, utf-8-emacs-unix and etc.

I want to know what's the difference among them.

I have googled them but couldn't find a proper answer.

P.S.

I ask this question because I encountered an encoding problem a few months ago. I wrote a php program in Emacs and in my ~/.emacs, I set

(prefer-coding-system 'utf-8)

but when browsing the php page in a browser, I found the browser couldn't display the content correctly due to the encoding problem even though I had wrote

<meta name="Content-Type" content="text/html; charset=UTF-8" />

in the page.

But after I used notepad++ to store the file in utf-8, the browser could display the content correctly.

So I want to learn more about encoding in Emacs.

like image 485
flyer Avatar asked Jul 25 '13 15:07

flyer


People also ask

What is the difference between UTF 16le and UTF 16be?

UTF-16 uses code units that are two bytes long. There are three UTF-16 sub-flavors: BE - uses big-endian byte serialization (most significant byte first) LE - uses little-endian byte serialization (least significant byte first)

What is the difference between UTF-8 and UTF-8?

UTF-8 is a valid IANA character set name, whereas utf8 is not. It's not even a valid alias. it refers to an implementation-provided locale, where settings of language, territory, and codeset are implementation-defined.

What are the types of UTF?

There are three different Unicode character encodings: UTF-8, UTF-16 and UTF-32. Of these three, only UTF-8 should be used for Web content.

What is the difference between UTF-8 and UTF-8 without BOM?

There is no official difference between UTF-8 and BOM-ed UTF-8. A BOM-ed UTF-8 string will start with the three following bytes. EF BB BF. Those bytes, if present, must be ignored when extracting the string from the file/stream.


1 Answers

The last part of the encoding name (eg. mac in utf-8-mac) is usually to describe the special character that will be used at the end of lines:

  • -mac: CR, the standard line delimiter with MacOS (until OS X)
  • -unix: LF the standard delimiter for unice systems (so the BSD-based Mac OS X)
  • -dos: CR+LF the delimiter for DOS / Windows

some additional encodings parameters include:

  • -emacs: support for encoding all Emacs characters (including non Unicode)
  • -with-signature: force the usage of the BOM (see below)
  • -auto: autodetect the BOM

You can combine the different possibilities, that makes the list shown in Emacs.

To get some information on type of line ending, BOMs and charsets provided by encodings, you can use describe-coding-system, or: C-hC

Concerning the BOM:

  • the utf standard defines a special signature to be placed at the beginning of the (text) files to distinct for the utf-16 encoding the order of the bytes (as utf-16 stores the characters with 2 bytes - or 16 bits) or endianess: some systems place the most significant byte first (big-endian -> utf-16be) some others place the least significant byte first (little-endian -> utf-16le). That signature is called BOM: the Byte Order Mark

  • in utf-8, each character is represented by a single byte (excepted for extended characters greater than 127, they use a special sequence of bytes) thus specifying a byte order is a nonsense but this signature is anyway usefull to detect an utf-8 file instead of a plain text ascii. An utf-8 file differs from an ascii file only on extended chars, and that can be impossible to detect without parsing the whole file until finding one when the pseudo-BOM make it visible instantly. (BTW Emacs is very efficient to make such auto-detection)

  • FYI, BOMs are the following bytes as very first bytes of a file:

    • utf-16le : FF FE
    • utf-16be : FE FF
    • utf-8 : EF BB BF
  • you can ask Emacs to open a file without any conversion with find-file-literally : if the first line begins with  you see the undecoded utf-8 BOM

  • for some additional help while playing with encodings, you can refer to this complementary answer "How to see encodings in emacs"

As @wvxvw said, your issue is a probable lack of BOM at the beginning of the file that made it wrongly interpreted and rendered. BTW, M-x hexl-mode is also a very handy tool to check the raw content of the file. Thanks for pointing it to me (I often use an external hex editor for that, while it could be done directly in Emacs)

like image 143
Seki Avatar answered Oct 13 '22 00:10

Seki