Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Should source code be saved in UTF-8 format

How important is it to save your source code in UTF-8 format?

Eclipse on Windows uses CP1252 character encoding by default. The CP1251 format means non UTF-8 characters can be saved and I have seen this happen if you copy and paste from a Word document for a comment.

The reason I ask is because out of habit I set-up Maven encoding to be in UTF-8 format and recently it has caught a few non mappable errors.

(update) Please add any reasons for doing so and why, are there some common gotchas that should be known?

(update) What is your goal? To find the best practice so when ask why should we use UTF-8 I have a good answer, right now I don't.

like image 523
JARC Avatar asked Feb 01 '10 16:02

JARC


People also ask

Should I always use UTF-8?

The answer is that UTF-8 is by far the best general-purpose data interchange encoding, and is almost mandatory if you are using any of the other protocols that build on it (mail, XML, HTML, etc). However, UTF-8 is a multi-byte encoding and relatively new, so there are lots of situations where it is a poor choice.

How do I save the source files with UTF-8 encoding?

You can make sure TextEdit saves files in Unicode (UTF-8) by going to TextEdit > Preferences… > Open and Save, and making sure the Save As setting is “Unicode (UTF-8)”.

Why should we specify UTF-8 file encoding?

Why use UTF-8? An HTML page can only be in one encoding. You cannot encode different parts of a document in different encodings. A Unicode-based encoding such as UTF-8 can support many languages and can accommodate pages and forms in any mixture of those languages.

Should I use UTF-8 or ASCII?

All characters in ASCII can be encoded using UTF-8 without an increase in storage (both requires a byte of storage). UTF-8 has the added benefit of character support beyond "ASCII-characters".


2 Answers

What is your goal? Balance your needs against the pros and cons of this choice.

UTF-8 Pros

  • allows use of all character literals without \uHHHH escaping

UTF-8 Cons

  • using non-ASCII character literals without \uHHHH increases risk of character corruption
    • font and keyboard issues can arise
    • need to document and enforce use of UTF-8 in all tools (editors, compilers build scripts, diff tools)
  • beware the byte order mark

ASCII Pros

  • character/byte mappings are shared by a wide range of encodings
    • makes source files very portable
    • often obviates the need for specifying encoding meta-data (since the files would be identical if they were re-encoded as UTF-8, Windows-1252, ISO 8859-1 and most things short of UTF-16 and/or EBCDIC)

ASCII Cons

  • limited character set
  • this isn't the 1960s

Note: ASCII is 7-bit, not "extended" and not to be confused with Windows-1252, ISO 8859-1, or anything else.

like image 75
McDowell Avatar answered Oct 14 '22 13:10

McDowell


Important is at least that you need to be consistent with the encoding used to avoid herrings. Thus not, X here, Y there and Z elsewhere. Save source code in encoding X. Set code input to encoding X. Set code output to encoding X. Set characterbased FTP transfer to encoding X. Etcetera.

Nowadays UTF-8 is a good choice as it covers every character the human world is aware of and is pretty everywhere supported. So, yes, I would set workspace encoding to it as well. I also use it so.

like image 28
BalusC Avatar answered Oct 14 '22 12:10

BalusC