I have a test site that has been using windows-1252 all along. They do need/use some symbols like the square root symbol. And they have no need to display in another language other than English. I was recently asked to switch it to UTF-8 because of some security concerns. After I changed it to UTF-8 the square roots and other symbols (which are being pulled out of an Oracle DB and passed through ColdFusion) would appear fine on the resulting web page. However, if I saved the document again (post to DB, page refreshes) the symbols transformed into strange characters. If I saved again even more strange characters would appear. So...
I've already read all these pages, still having a little trouble grasping it all. Hoping someone here and help clarify for me. Thanks!
* * * UPDATE * * *
I appreciate all that help so far to make this easier to understand. I'll simplify the original 3 questions so hopefully a clear answer can be reached, so here it is: The customer doesn't need support for other languages, they will be using some HTML5 tags and a TON of JSON/XML traffic sent back and forth via jQuery.ajax(). Given that info, from a security standpoint, is there anything wrong with keeping the database set to NLS_CHARACTERSET: WE8MSWIN1252
and the webpages set to <CFHEADER NAME="Content-Type" value="text/html; charset=windows-1252">
? Thank you.
Here is another question that is a slight spin off from this one: Why am I able to use a character that's not part of a charset (windows-1252)?.
Windows-1252 has characters between bytes 127 and 255 that UTF-8 has a different encoding for. Any visible character in the ASCII range (127 and below) are encoded 1:1 in UTF-8. So while you can convert between the two, A CP-1252 string is not guaranteed to be a valid UTF-8 string.
Windows-1252 or CP-1252 (code page 1252) is a single-byte character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows for English and many European languages including Spanish, French, and German.
The answer is that UTF-8 is by far the best general-purpose data interchange encoding, and is almost mandatory if you are using any of the other protocols that build on it (mail, XML, HTML, etc). However, UTF-8 is a multi-byte encoding and relatively new, so there are lots of situations where it is a poor choice.
ANSI encoding is a slightly generic term used to refer to the standard code page on a system, usually Windows. It is more properly referred to as Windows-1252 on Western/U.S. systems. (It can represent certain other Windows code pages on other systems.)
Windows 1252 is one of the many many fixed size character sets. Mac has its own set. there are a few ISO for various parts of the Europe and for some other parts of the world. Most of them have slight variations.
The good point is that you have a fixed-size character, meaning 1 character = 1 byte no matter what.
The bad points are:
That include any citation you would like to make. In windows-1252 you can't display russian, greek, polish ...
UTF-8 is the standard encoding for unicode representation on 1+ bytes. It can represent a very large majority of the characters you may encounter, although it is designed for latin-based languages, as other languages take more storage space.
It in used in XML, JSON, and most types of web services you may find. It is a good default when you don't know what encoding to use. It allows to limit the number of encoding issues, such as "I though you were in Latin-1 / No, I was using latin-9, but then this guy on mac used Roman". If you have more than 1 people working on the content of the website, they may have different encodings on their plateforme, and therefore your content may be messed up at some point.
UTF-8 is, as far as I know, the only way to easily standardize the encoding used between people without discussion.
Typical example is, if your website is encoded in windows1252, and the new dev has a mac, you'll probably be in trouble.
You claim that Windows-1252 offers everything you need but the √ symbol is a counter-example. You must be using one of these tricks:
√
, √
or similarIn either case, your solution is not portable: stuff will only display correctly in a properly configured web browser. Everything else (database, JavaScript, text files, plain text e-mail messages...) will not contain the real data.
Additionally, JSON's only encoding is UTF-8. JavaScript will normally make the conversions for you but you must ensure that all your tool-chain behaves similarly.
So to answer your main question: there's nothing wrong in using Windows-1252 if that's all you need. The problem is that you already need more than it can offer.
As about your problems with UTF-8, it's obvious that UTF-8 is a full Unicode encoding so it does meet all the requirements. (Not being able to make it work can your reason to dump it but it isn't a technical reason.) My guess is that, since your current data doesn't have actual square root symbols, switching encodings breaks the trick you were using. You need to:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With