Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert escaped unicode (\u008E) to accented character (Ž) in Ruby?

Tags:

ruby

encoding

I am having a very difficult time with this:

# contained within:
"MA\u008EEIKIAI"

# should be
"MAŽEIKIAI"

# nature of string
$ p string3
"MA\u008EEIKIAI" 

$ puts string3
MAEIKIAI

$ string3.inspect
"\"MA\\u008EEIKIAI\""

$ string3.bytes
#<Enumerator: "MA\u008EEIKIAI":bytes> 

Any ideas on where to start?

Note: this is not a duplicate of my previous question.

like image 436
Damien Roche Avatar asked Jun 11 '13 12:06

Damien Roche


People also ask

How do I escape Unicode characters in word?

Anything that you paste or enter in the text area on the left automatically gets escaped on the right. It supports the most popular Unicode encodings (such as UTF-8, UTF-16, UCS-2, UTF-32, and UCS-4) and it works with emoji characters. You can escape Unicode symbols to sequences of bytes or code points and adjust the escape format.

Are there any valid Unicode characters in Ruby?

In fact, there are valid, invisible Unicode characters. This looks like a method without a name, which normally isn’t allowed. But it works because of that invisible Unicode character! Ruby has methods for working with different encoding systems.

What is a Unicode escape sequence?

Unicode escape sequences convert a single character to the format of a 4-digit hexadecimal code point, such as uXXXX. For example, "A" becomes "u0041". Unicode non-BMP characters represented as surrogate pairs do not fit in the 4-digit code point, so they are represented in the following format for each programming language.

How to convert Unicode emojis to byte escape format?

As we have selected the UTF8 encoding, it converts emojis and other characters to a sequence of one, two, three, or four bytes per Unicode symbol. Then it adds the hexadecimal prefix "0x" to each byte and displays bytes separated by spaces. These options will be used automatically if you select this example. Select byte escape format.


2 Answers

\u008E means that the unicode character with the codepoint 8e (in hex) appears at that point in the string. This character is the control character “SINGLE SHIFT TWO” (see the code chart (pdf)). The character Ž is at the codepoint u017d. However it is at position 8e in the Windows CP-1252 encoding. Somehow you’ve got your encodings mixed up.

The easiest way to “fix” this is probably just to open the file containing the string (or the database record or whatever) and edit it to be correct. The real solution will depend on where the string in question came from and how many bad strings you have.

Assuming the string is in UTF-8 encoding, \u008E will consist of the two bytes c2 and 8e. Note that the second byte, 8e, is the same as the encoding of Ž in CP-1252. On way to convert the string would be something like this:

string3.force_encoding('BINARY') # treat the string just as bytes for now
string3.gsub!(/\xC2/n, '')       # remove the C2 byte
string3.force_encoding('CP1252') # give the string the correct encoding
string3.encode('UTF-8')          # convert to the desired encoding

Note that this isn’t a general solution to fix all issues like this. Not all CP-1252 characters, when mangled and expressed in UTF-8 this way will amenable to conversion like this. Some will be two bytes c2 xx where xx the correct byte (like in this case), others will be c3 yy where yy is a different byte.

like image 52
matt Avatar answered Sep 30 '22 09:09

matt


What about using Regexp & String#pack to convert the Unicode escape?

str = "MA\\u008EEIKIAI"
puts str    #=> MA\u008EEIKIAI

str.gsub!(/\\u(.{4})/) do |match|
  [$1.to_i(16)].pack('U')
end
puts str    #=> MA EIKIAI
like image 35
Arie Xiao Avatar answered Sep 30 '22 09:09

Arie Xiao