Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Anything wrong with using windows-1252 instead of UTF-8

I have a test site that has been using windows-1252 all along. They do need/use some symbols like the square root symbol. And they have no need to display in another language other than English. I was recently asked to switch it to UTF-8 because of some security concerns. After I changed it to UTF-8 the square roots and other symbols (which are being pulled out of an Oracle DB and passed through ColdFusion) would appear fine on the resulting web page. However, if I saved the document again (post to DB, page refreshes) the symbols transformed into strange characters. If I saved again even more strange characters would appear. So...

  1. If I don't need anything other than English is there anything wrong with sticking to windows-1252? Any security/hacking issues?
  2. Are there any implications of NOT using UTF-8 if you are using HTML5 (since that is the default encoding for HTML5)?
  3. If its recommended that I should switch to UTF-8, how do I get the currently stored square root symbols (and other symbols) to work?

I've already read all these pages, still having a little trouble grasping it all. Hoping someone here and help clarify for me. Thanks!

  1. https://www.owasp.org/index.php/Canonicalization,_locale_and_Unicode
  2. Excellent description of how UTF-8 came about, why it’s awesome, and the problems it solves… https://www.youtube.com/watch?v=MijmeoH9LT4
  3. http://www.w3.org/International/questions/qa-choosing-encodings “Use UTF-8, if you can”. “In fact the HTML5 specification draft currently says "Authors are encouraged to use UTF-8. Conformance checkers may advise authors against using legacy encodings. Authoring tools should default to using UTF-8 for newly-created documents."”
  4. http://www.w3schools.com/tags/ref_charactersets.asp “For HTML5, the default character encoding is UTF-8.”
  5. http://www.joelonsoftware.com/articles/Unicode.html

* * * UPDATE * * *

I appreciate all that help so far to make this easier to understand. I'll simplify the original 3 questions so hopefully a clear answer can be reached, so here it is: The customer doesn't need support for other languages, they will be using some HTML5 tags and a TON of JSON/XML traffic sent back and forth via jQuery.ajax(). Given that info, from a security standpoint, is there anything wrong with keeping the database set to NLS_CHARACTERSET: WE8MSWIN1252 and the webpages set to <CFHEADER NAME="Content-Type" value="text/html; charset=windows-1252">? Thank you.

Here is another question that is a slight spin off from this one: Why am I able to use a character that's not part of a charset (windows-1252)?.

like image 397
gfrobenius Avatar asked Jan 31 '14 21:01

gfrobenius


People also ask

What is the difference between Windows-1252 and UTF-8?

Windows-1252 has characters between bytes 127 and 255 that UTF-8 has a different encoding for. Any visible character in the ASCII range (127 and below) are encoded 1:1 in UTF-8. So while you can convert between the two, A CP-1252 string is not guaranteed to be a valid UTF-8 string.

What is Windows-1252 encoding?

Windows-1252 or CP-1252 (code page 1252) is a single-byte character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows for English and many European languages including Spanish, French, and German.

Should I always use UTF-8?

The answer is that UTF-8 is by far the best general-purpose data interchange encoding, and is almost mandatory if you are using any of the other protocols that build on it (mail, XML, HTML, etc). However, UTF-8 is a multi-byte encoding and relatively new, so there are lots of situations where it is a poor choice.

Is ANSI and Windows-1252 the same?

ANSI encoding is a slightly generic term used to refer to the standard code page on a system, usually Windows. It is more properly referred to as Windows-1252 on Western/U.S. systems. (It can represent certain other Windows code pages on other systems.)


2 Answers

Windows 1252 is one of the many many fixed size character sets. Mac has its own set. there are a few ISO for various parts of the Europe and for some other parts of the world. Most of them have slight variations.

The good point is that you have a fixed-size character, meaning 1 character = 1 byte no matter what.

The bad points are:

  • Some people may not have your encoding installed
  • Some people may use a slightly different encoding, resulting in very few issues, not obvious to see, but very ugly on the long run
  • You can only support a few languages

That include any citation you would like to make. In windows-1252 you can't display russian, greek, polish ...

UTF-8 is the standard encoding for unicode representation on 1+ bytes. It can represent a very large majority of the characters you may encounter, although it is designed for latin-based languages, as other languages take more storage space.

It in used in XML, JSON, and most types of web services you may find. It is a good default when you don't know what encoding to use. It allows to limit the number of encoding issues, such as "I though you were in Latin-1 / No, I was using latin-9, but then this guy on mac used Roman". If you have more than 1 people working on the content of the website, they may have different encodings on their plateforme, and therefore your content may be messed up at some point.

UTF-8 is, as far as I know, the only way to easily standardize the encoding used between people without discussion.

Typical example is, if your website is encoded in windows1252, and the new dev has a mac, you'll probably be in trouble.

like image 168
njzk2 Avatar answered Nov 15 '22 04:11

njzk2


You claim that Windows-1252 offers everything you need but the √ symbol is a counter-example. You must be using one of these tricks:

  • HTML entities: &radic;, &#8730; or similar
  • Print another character and change the font

In either case, your solution is not portable: stuff will only display correctly in a properly configured web browser. Everything else (database, JavaScript, text files, plain text e-mail messages...) will not contain the real data.

Additionally, JSON's only encoding is UTF-8. JavaScript will normally make the conversions for you but you must ensure that all your tool-chain behaves similarly.

So to answer your main question: there's nothing wrong in using Windows-1252 if that's all you need. The problem is that you already need more than it can offer.

As about your problems with UTF-8, it's obvious that UTF-8 is a full Unicode encoding so it does meet all the requirements. (Not being able to make it work can your reason to dump it but it isn't a technical reason.) My guess is that, since your current data doesn't have actual square root symbols, switching encodings breaks the trick you were using. You need to:

  1. Find out what current data looks like
  2. Run a one-time search and replace
like image 40
Álvaro González Avatar answered Nov 15 '22 05:11

Álvaro González