Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does VIM disregard my file's BOM?

I need a file that I want to make sure is encoded with utf8.

So, I create the file

c:\> gvim umlaute.txt

In VIM I type the Umlaute:

äöü

I check the encoding ...

:set enc

(VIM echoes encoding=latin1)

and then I check the file encoding ...

:set fenc

(VIM echoes fileencoding=)

Then I write the file

:w

And check the file's size on the harddisk:

!dir umlaute.txt

(The size is 5 bytes) That is of course expected, 3 bytes for the text and 2 for the \x0a \x0d.

Ok, so I now set the encoding to

:set enc=utf8

The buffer get's wierd

<e4><f6><fc>

I guess this is the hex representation of the ascii characters I previously typed in. So I rewrite them

äöü

Writing, checking size:

:w
:$ dir umlaute.txt

This time, it's 8 bytes. I guess that makes sense 2 bytes for every character plus \x0d \x0a.

Ok, so I want to make sure the next time I open the file it will be opened with encodiung=utf8.

:setb
:w

:$ dir umlaute.txt

11 Bytes. This is of course 8 (previous) Bytes + 3 Bytes for the BOM (ef bb bf).

So I

:quit

vim and open the file again

and check, if the encoding is set:

:set enc

But VIM insists its encoding=latin1.

So, why is that. I would have expected the BOM to tell VIM that this is a UTF8 file.

like image 845
René Nyffenegger Avatar asked Aug 26 '11 11:08

René Nyffenegger


People also ask

What is UTF with BOM?

The UTF-8 file signature (commonly also called a "BOM") identifies the encoding format rather than the byte order of the document. UTF-8 is a linear sequence of bytes and not sequence of 2-byte or 4-byte units where the byte order is important. Encoding. Encoded BOM. UTF-8.

How do I set encoding in vim?

It would be set encoding=utf-8 , no quotes. set encoding="utf-8" would be an error because "utf-8" would be considered a comment, thus it would be the same as set encoding= . Vim does not actually default 'encoding' to UTF-8. It defaults to latin1, but will change based on the locale of your environment.

How do I get rid of byte order marks?

How to remove BOM. If you want to remove the byte order mark from a source code, you need a text editor that offers the option of saving the mark. You read the file with the BOM into the software, then save it again without the BOM and thereby convert the coding. The mark should then no longer appear.


3 Answers

You are confusing 'encoding' which is a Vim global setting, and 'fileencoding', which is a local setting to each buffer.

When opening a file, the variable 'fileencodings' (note the final s) determines what encodings Vim will try to open the file with. If it starts with ucs-bom then any file with a BOM will be properly opened if it parses correctly.

If you want to change the encoding of a file, you should use :set fenc=<foo>. If you want to remove the BOM you should use :set [no]bomb. Then use :w to save.

Avoid changing enc after having opened a buffer, it could mess up things. enc determines what characters vim can work with, and it has nothing to do with the files that you are working with.

Details

c:\> gvim umlaute.txt

You are opening vim, with a nonexistent file name. Vim creates a buffer, gives it that name, and sets fenc to an empty value since there is no file associated with it.

:set enc

(VIM echoes encoding=latin1)

This means that the Vim stores the buffer contents in ISO-8859-1 (maybe another number).

and then I check the file encoding ...

:set fenc

(VIM echoes fileencoding=)

This is normal, there is no file for the moment.

Then I write the file

:w

Since 'fileencoding' is empty, it will write it to the disk using the internal encoding, latin1.

And check the file's size on the harddisk:

!dir umlaute.txt

(The size is 5 bytes) That is of course expected, 3 bytes for the text and 2 for the \x0a \x0d.

Ok, so I now set the encoding to

:set enc=utf8

WRONG! You are telling vim that it must interpret the buffer contents as UTF8 content. the buffer contains, in hexadecimal, e4 f6 fc 0a 0d, the first three bytes are invalid UTF8 character sequences. You should have typed :set fenc=utf-8. This would have converted the buffer.

The buffer get's wierd

That's what happens when you force Vim to interpret an illegal UTF-8 file as UTF8.

I guess this is the hex representation of the ascii characters I previously typed in. So I rewrite them

äöü

Writing, checking size:

:w :$ dir umlaute.txt

This time, it's 8 bytes. I guess that makes sense 2 bytes for every character plus \x0d \x0a.

Ok, so I want to make sure the next time I open the file it will be opened with encodiung=utf8.

:set bomb :w

:$ dir umlaute.txt

11 Bytes. This is of course 8 (previous) Bytes + 3 Bytes for the BOM (ef bb bf).

So I

:quit

vim and open the file again

and check, if the encoding is set:

:set enc

But VIM insists its encoding=latin1.

You should run set fenc? to know what is the detected encoding of your file. And if you want Vim to be able to work with Unicode files, you should set in your vimrc that 'enc' is utf-8.

like image 124
Benoit Avatar answered Nov 07 '22 04:11

Benoit


After many attempts I get here is a working example:

    setglobal bomb 
    set fileencodings=ucs-bom,utf-8,cp1251,koi8-r,cp866
    set nobin
    set fileencoding=utf-8 bomb

and if you want to cteate new fiel with BOM:

    c:\gvim umlaute.txt

it is working now!

like image 43
Salmaner Avatar answered Nov 07 '22 06:11

Salmaner


:help bomb reveals the following information:

When writing a file and the following conditions are met, a BOM (Byte Order Mark) is prepended to the file:

  • this option is on (edit: i.e. ':set bomb')
  • the 'binary' option is off
  • 'fileencoding' is "utf-8", "ucs-2", "ucs-4" or one of the little/big endian variants.

Some applications use the BOM to recognize the encoding of the file. Often used for UCS-2 files on MS-Windows. For other applications it causes trouble, for example: "cat file1 file2" makes the BOM of file2 appear halfway the resulting file. Gcc doesn't accept a BOM. When Vim reads a file and 'fileencodings' starts with "ucs-bom", a check for the presence of the BOM is done and 'bomb' set accordingly. Unless 'binary' is set, it is removed from the first line, so that you don't see it when editing. When you don't change the options, the BOM will be restored when writing the file.

So try setting this in your .vimrc:

set fileencodings=ucs-bom,utf-8,latin1
set nobin
setglobal fileencoding=utf-8
like image 1
Sir Rippov the Maple Avatar answered Nov 07 '22 05:11

Sir Rippov the Maple