Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a drastic difference between UTF-8 and UTF-16

I call a webservice, that gives me back a response xml that has UTF-8 encoding. I checked that in java using getAllHeaders() method.

Now, in my java code, I take that response and then do some processing on it. And later, pass it on to a different service.

Now, I googled a bit and found out that by default the encoding in Java for strings is UTF-16.

In my response xml, one of the elements had a character É. Now this got screwed in the post processing request that I make to a different service.

Instead of sending É, it sent some jibberish stuff. Now I wanted to know, will there be really a lot of difference in the two of these encodings? And if I wanted to know what will É convert from UTF-8 to UTF-16, then how can I do that?

like image 412
Kraken Avatar asked Mar 14 '14 12:03

Kraken


People also ask

Is UTF-16 better than UTF-8?

UTF-16 is better where ASCII is not predominant, since it uses 2 bytes per character, primarily. UTF-8 will start to use 3 or more bytes for the higher order characters where UTF-16 remains at just 2 bytes for most characters.

Why is UTF-16 not used?

So far so good, you can use UTF32 internally and UTF8 for serialization. But UTF16 has no benefits: It's endian-dependent, it's variable length, it takes lots of space, it's not ASCII-compatible.

Should I always use UTF-8?

When you need to write a program (performing string manipulations) that needs to be very very fast and that you're sure that you won't need exotic characters, may be UTF-8 is not the best idea. In every other situations, UTF-8 should be a standard. UTF-8 works well on almost every recent software, even on Windows.

Why is UTF-8 the best?

UTF-8 is the de facto standard character encoding for Unicode. UTF-8 is like UTF-16 and UTF-32, because it can represent every character in the Unicode character set. But unlike UTF-16 and UTF-32, it possesses the advantages of being backward-compatible with ASCII.


1 Answers

Both UTF-8 and UTF-16 are variable length encodings. However, in UTF-8 a character may occupy a minimum of 8 bits, while in UTF-16 character length starts with 16 bits.

Main UTF-8 pros:

  1. Basic ASCII characters like digits, Latin characters with no accents, etc. occupy one byte which is identical to US-ASCII representation. This way all US-ASCII strings become valid UTF-8, which provides decent backwards compatibility in many cases.
  2. No null bytes, which allows to use null-terminated strings, this introduces a great deal of backwards compatibility too.

Main UTF-8 cons:

  1. Many common characters have different length, which slows indexing and calculating a string length terribly.

Main UTF-16 pros:

  1. Most reasonable characters, like Latin, Cyrillic, Chinese, Japanese can be represented with 2 bytes. Unless really exotic characters are needed, this means that the 16-bit subset of UTF-16 can be used as a fixed-length encoding, which speeds indexing.

Main UTF-16 cons:

  1. Lots of null bytes in US-ASCII strings, which means no null-terminated strings and a lot of wasted memory.

In general, UTF-16 is usually better for in-memory representation while UTF-8 is extremely good for text files and network protocol

like image 102
Arjun Chaudhary Avatar answered Sep 16 '22 23:09

Arjun Chaudhary