Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to convert text files between character sets?

People also ask

Are .txt files UTF-8?

Text files in Windows don't have a format. There's an unofficial convention that if the file starts with the BOM codepoint in UTF-8 format that it's UTF-8, but that convention isn't universally supported. That would be the 3 byte sequence "\xef\xbf\xbe" , i.e. ￾ in the Latin-1 character set.

How do I convert a file to UTF-8?

Click Tools, then select Web options. Go to the Encoding tab. In the dropdown for Save this document as: choose Unicode (UTF-8). Click Ok.


Stand-alone utility approach

iconv -f ISO-8859-1 -t UTF-8 in.txt > out.txt
-f ENCODING  the encoding of the input
-t ENCODING  the encoding of the output

You don't have to specify either of these arguments. They will default to your current locale, which is usually UTF-8.


Try VIM

If you have vim you can use this:

Not tested for every encoding.

The cool part about this is that you don't have to know the source encoding

vim +"set nobomb | set fenc=utf8 | x" filename.txt

Be aware that this command modify directly the file


Explanation part!

  1. + : Used by vim to directly enter command when opening a file. Usualy used to open a file at a specific line: vim +14 file.txt
  2. | : Separator of multiple commands (like ; in bash)
  3. set nobomb : no utf-8 BOM
  4. set fenc=utf8 : Set new encoding to utf-8 doc link
  5. x : Save and close file
  6. filename.txt : path to the file
  7. " : qotes are here because of pipes. (otherwise bash will use them as bash pipe)

Under Linux you can use the very powerful recode command to try and convert between the different charsets as well as any line ending issues. recode -l will show you all of the formats and encodings that the tool can convert between. It is likely to be a VERY long list.


iconv(1)

iconv -f FROM-ENCODING -t TO-ENCODING file.txt

Also there are iconv-based tools in many languages.