How important is it to save your source code in UTF-8 format?
Eclipse on Windows uses CP1252 character encoding by default. The CP1251 format means non UTF-8 characters can be saved and I have seen this happen if you copy and paste from a Word document for a comment.
The reason I ask is because out of habit I set-up Maven encoding to be in UTF-8 format and recently it has caught a few non mappable errors.
(update) Please add any reasons for doing so and why, are there some common gotchas that should be known?
(update) What is your goal? To find the best practice so when ask why should we use UTF-8 I have a good answer, right now I don't.
The answer is that UTF-8 is by far the best general-purpose data interchange encoding, and is almost mandatory if you are using any of the other protocols that build on it (mail, XML, HTML, etc). However, UTF-8 is a multi-byte encoding and relatively new, so there are lots of situations where it is a poor choice.
You can make sure TextEdit saves files in Unicode (UTF-8) by going to TextEdit > Preferences… > Open and Save, and making sure the Save As setting is “Unicode (UTF-8)”.
Why use UTF-8? An HTML page can only be in one encoding. You cannot encode different parts of a document in different encodings. A Unicode-based encoding such as UTF-8 can support many languages and can accommodate pages and forms in any mixture of those languages.
All characters in ASCII can be encoded using UTF-8 without an increase in storage (both requires a byte of storage). UTF-8 has the added benefit of character support beyond "ASCII-characters".
What is your goal? Balance your needs against the pros and cons of this choice.
UTF-8 Pros
\uHHHH
escapingUTF-8 Cons
\uHHHH
increases risk of character corruption ASCII Pros
ASCII Cons
Note: ASCII is 7-bit, not "extended" and not to be confused with Windows-1252, ISO 8859-1, or anything else.
Important is at least that you need to be consistent with the encoding used to avoid herrings. Thus not, X here, Y there and Z elsewhere. Save source code in encoding X. Set code input to encoding X. Set code output to encoding X. Set characterbased FTP transfer to encoding X. Etcetera.
Nowadays UTF-8
is a good choice as it covers every character the human world is aware of and is pretty everywhere supported. So, yes, I would set workspace encoding to it as well. I also use it so.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With