I have the following string:
l\u0092issue
My question is how to convert it to utf8 characters ?
I have tried that
1.9.3p484 :024 > "l\u0092issue".encode('utf-8')
=> "l\u0092issue"
Strings can be converted to arrays using a combination of the split method and some regular expressions. The split method serves to break up the string into distinct parts that can be placed into array element. The regular expression tells split what to use as the break point during the conversion process.
Ruby has the method Encoding. default_external which defines what the current operating systems default encoding is. Ruby defaults to UTF-8 as its encoding so if it is opening up files from the operating system and the default is different from UTF-8, it will transcode the input from that encoding to UTF-8.
The equivalent in C# is the String class. According to MSDN: (A String) Represents text as a series of Unicode characters. So, if you do string str = "a string here"; , you have a Unicode string.
You seem to have got your encodings into a bit of a mix up. If you haven’t already, you should first read Joel Spolsky’s article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) which provides a good introduction into this type of thing. There is a good set of articles on how Ruby handles character encodings at http://graysoftinc.com/character-encodings/understanding-m17n-multilingualization. You could also have a look at the Ruby docs for String and Encoding.
In this specific case, the string l\u0092issue
means that the second character is the character with the unicode codepoint 0x92. This codepoint is PRIVATE USE TWO
(see the chart), which basically means this position isn’t used.
However, looking at the Windows CP-1252 encoding, position 0x92 is occupied by the character ’
, so if this is the missing character the the string would be l’issue
, whick looks a lot more likely even though I don’t speak French.
What I suspect has happened is your program has received the string l’issue
encoded in CP-1252, but has assumed it was encoded in ISO-8859-1 (ISO-8859-1 and CP-1252 are quite closely related) and re-encoded it to UTF-8 leaving you with the string you now have.
The real fix for you is to be careful about the encodings of any strings that enter (and leave) your program, and how you manage them.
To transform your string to l’issue
, you can encode
it back to ISO-8859-1
, then use force_encoding
to tell Ruby the real encoding of CP-1252, and then you can re-encode to UTF-8:
2.1.0 :001 > s = "l\u0092issue"
=> "l\u0092issue"
2.1.0 :002 > s = s.encode('iso-8859-1')
=> "l\x92issue"
2.1.0 :003 > s.force_encoding('cp1252')
=> "l\x92issue"
2.1.0 :004 > s.encode('utf-8')
=> "l’issue"
This is only really a demonstration of what is going on though. The real solution is to make sure you’re handling encodings correctly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With