Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Vim's encoding options

Tags:

vim

Although Vim's help is a treasure cave of information, in some cases I find it mindboggling. Its explanation of different encoding-related options is one such case.

Can someone please explain to me, in simple terms, what do encoding, fileencoding and fileencodings settings do, and how can I
a) view the encoding of the current file?
b) change the encoding of the current file?
c) do something else which is used often, but slips my mind right now?

like image 599
Rook Avatar asked Nov 14 '11 12:11

Rook


2 Answers

  • encoding is used by Vim to know what character sets it supports and how characters are stored internally.

    You shouldn't really modify this setting; it should default to something Unicodeish. Otherwise you couldn't read and write files with an extended character set.
    Put :set encoding=utf-8 at the start of your vimrc if you are not sure, and never play with that setting again except if you have to read huge files for one session with a 1-byte encoding.

  • fileencoding stores the encoding of the current buffer.
    You might read and write to this variable and it will do what you want.
    When you modify it, the file will be marked as modified, and when you save it (:w or :up) to disk, it will be written with the encoding that you specified.

  • fileencodings tells Vim how to detect the encoding of every file you read (in order to determine the value of fileencoding). It is a list of encodings, that are tried in order, and the first encoding that is consistent with the binary contents of the file is assumed to be the encoding of the file you are reading.
    Set it once and then forget it. You might need to change it if you know that you are going to open plenty of files and that they all use the same encoding, and you don't want to lose time trying to check other encodings. Default which is ucs-bom,utf8,latin1 is nice IMO if you are in Western Europe, because almost any file will be opened in the correct encoding. However with this setting, when you open plain ASCII files (ie, which byte representation would be the same in UTF8 and in any latin-based code page encoding) the file will be assumed to be UTF8, and saved as such.
    Example: if you set fileencodings to latin1,utf8, every file that you open will be read as latin1 because trying to read a file with latin1 encoding never fails: there is a bijection between the 256 possible byte values and the individual characters in the character set.
    Conversely if you try fileencodings=ucs-bom,utf8,latin1 Vim will first check for a byte-order-mark and decode Unicode files with BOM, then if it failed (no BOM) try to read your files in UTF-8, and if it fails (because some byte sequences in UTF8 are invalid) open your file in latin1.

  • In order to reload a file with proper encoding (case when fileencodings did not work properly) you can do: :e! ++enc=<the_encoding>.

tl;dr:

  1. view the encoding of the current file: :echo &fileencoding (shorter: :echo &fenc or :set fenc? or :verb set fenc?)
  2. change the encoding of the current file: :set fenc=…… and call then :w as many times as you want.
  3. reload your file with proper encoding: :e! ++enc=…
like image 173
Benoit Avatar answered Sep 20 '22 22:09

Benoit


encoding:
The internal representation. View or set with:

:set encoding
:set encoding = utf-8

fileencoding:

The representation that will be used when the file is written. View or set with:

:set fileencoding
:set fileencoding = utf-8

fileencodings:

The list of possible encodings that are tested when reading a file. View or set with:

:set fileencodings
:set fileencodings= utf-8,latin-1,cp1251

Here is the list of possible encodings from the vim documentation (mbyte-encoding)

Supported 'encoding' values are:            *encoding-values*
1   latin1  8-bit characters (ISO 8859-1, also used for cp1252)
1   iso-8859-n  ISO_8859 variant (n = 2 to 15)
1   koi8-r  Russian
1   koi8-u  Ukrainian
1   macroman    MacRoman (Macintosh encoding)
1   8bit-{name} any 8-bit encoding (Vim specific name)
1   cp437   similar to iso-8859-1
1   cp737   similar to iso-8859-7
1   cp775   Baltic
1   cp850   similar to iso-8859-4
1   cp852   similar to iso-8859-1
1   cp855   similar to iso-8859-2
1   cp857   similar to iso-8859-5
1   cp860   similar to iso-8859-9
1   cp861   similar to iso-8859-1
1   cp862   similar to iso-8859-1
1   cp863   similar to iso-8859-8
1   cp865   similar to iso-8859-1
1   cp866   similar to iso-8859-5
1   cp869   similar to iso-8859-7
1   cp874   Thai
1   cp1250  Czech, Polish, etc.
1   cp1251  Cyrillic
1   cp1253  Greek
1   cp1254  Turkish
1   cp1255  Hebrew
1   cp1256  Arabic
1   cp1257  Baltic
1   cp1258  Vietnamese
1   cp{number}  MS-Windows: any installed single-byte codepage
2   cp932   Japanese (Windows only)
2   euc-jp  Japanese (Unix only)
2   sjis    Japanese (Unix only)
2   cp949   Korean (Unix and Windows)
2   euc-kr  Korean (Unix only)
2   cp936   simplified Chinese (Windows only)
2   euc-cn  simplified Chinese (Unix only)
2   cp950   traditional Chinese (on Unix alias for big5)
2   big5    traditional Chinese (on Windows alias for cp950)
2   euc-tw  traditional Chinese (Unix only)
2   2byte-{name} Unix: any double-byte encoding (Vim specific name)
2   cp{number}  MS-Windows: any installed double-byte codepage
u   utf-8   32 bit UTF-8 encoded Unicode (ISO/IEC 10646-1)
u   ucs-2   16 bit UCS-2 encoded Unicode (ISO/IEC 10646-1)
u   ucs-2le like ucs-2, little endian
u   utf-16  ucs-2 extended with double-words for more characters
u   utf-16le    like utf-16, little endian
u   ucs-4   32 bit UCS-4 encoded Unicode (ISO/IEC 10646-1)
u   ucs-4le like ucs-4, little endian

The {name} can be any encoding name that your system supports.  It is passed
to iconv() to convert between the encoding of the file and the current locale.
For MS-Windows "cp{number}" means using codepage {number}.
Examples:
    :set encoding=8bit-cp1252
    :set encoding=2byte-cp932

The MS-Windows codepage 1252 is very similar to latin1.  For practical reasons
the same encoding is used and it's called latin1.  'isprint' can be used to
display the characters 0x80 - 0xA0 or not.

Several aliases can be used, they are translated to one of the names above.
An incomplete list:

1   ansi    same as latin1 (obsolete, for backward compatibility)
2   japan   Japanese: on Unix "euc-jp", on MS-Windows cp932
2   korea   Korean: on Unix "euc-kr", on MS-Windows cp949
2   prc     simplified Chinese: on Unix "euc-cn", on MS-Windows cp936
2   chinese     same as "prc"
2   taiwan  traditional Chinese: on Unix "euc-tw", on MS-Windows cp950
u   utf8    same as utf-8
u   unicode same as ucs-2
u   ucs2be  same as ucs-2 (big endian)
u   ucs-2be same as ucs-2 (big endian)
u   ucs-4be same as ucs-4 (big endian)
u   utf-32  same as ucs-4
u   utf-32le    same as ucs-4le
    default     stands for the default value of 'encoding', depends on the
    environment

For the UCS codes the byte order matters.  This is tricky, use UTF-8 whenever
you can. The default is to use big-endian (most significant byte comes
first):
    name    bytes       char 
    ucs-2         11 22     1122
    ucs-2le       22 11     1122
    ucs-4   11 22 33 44 11223344
    ucs-4le 44 33 22 11 11223344

On MS-Windows systems you often want to use "ucs-2le", because it uses little
endian UCS-2.

There are a few encodings which are similar, but not exactly the same.  Vim
treats them as if they were different encodings, so that conversion will be
done when needed.  You might want to use the similar name to avoid conversion
or when conversion is not possible:

    cp932, shift-jis, sjis
    cp936, euc-cn
like image 42
meesern Avatar answered Sep 21 '22 22:09

meesern