Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert a unicode string to characters in Ruby?

Tags:

string

ruby

I have the following string:

l\u0092issue

My question is how to convert it to utf8 characters ?

I have tried that

1.9.3p484 :024 > "l\u0092issue".encode('utf-8')
 => "l\u0092issue" 
like image 576
Bolo Avatar asked Jan 16 '14 20:01

Bolo


People also ask

How do I convert a string to an array in Ruby?

Strings can be converted to arrays using a combination of the split method and some regular expressions. The split method serves to break up the string into distinct parts that can be placed into array element. The regular expression tells split what to use as the break point during the conversion process.

How do I encode in Ruby?

Ruby has the method Encoding. default_external which defines what the current operating systems default encoding is. Ruby defaults to UTF-8 as its encoding so if it is opening up files from the operating system and the default is different from UTF-8, it will transcode the input from that encoding to UTF-8.

Is C# string Unicode?

The equivalent in C# is the String class. According to MSDN: (A String) Represents text as a series of Unicode characters. So, if you do string str = "a string here"; , you have a Unicode string.


1 Answers

You seem to have got your encodings into a bit of a mix up. If you haven’t already, you should first read Joel Spolsky’s article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) which provides a good introduction into this type of thing. There is a good set of articles on how Ruby handles character encodings at http://graysoftinc.com/character-encodings/understanding-m17n-multilingualization. You could also have a look at the Ruby docs for String and Encoding.

In this specific case, the string l\u0092issue means that the second character is the character with the unicode codepoint 0x92. This codepoint is PRIVATE USE TWO (see the chart), which basically means this position isn’t used.

However, looking at the Windows CP-1252 encoding, position 0x92 is occupied by the character , so if this is the missing character the the string would be l’issue, whick looks a lot more likely even though I don’t speak French.

What I suspect has happened is your program has received the string l’issue encoded in CP-1252, but has assumed it was encoded in ISO-8859-1 (ISO-8859-1 and CP-1252 are quite closely related) and re-encoded it to UTF-8 leaving you with the string you now have.

The real fix for you is to be careful about the encodings of any strings that enter (and leave) your program, and how you manage them.

To transform your string to l’issue, you can encode it back to ISO-8859-1, then use force_encoding to tell Ruby the real encoding of CP-1252, and then you can re-encode to UTF-8:

2.1.0 :001 > s = "l\u0092issue"
 => "l\u0092issue" 
2.1.0 :002 > s = s.encode('iso-8859-1')
 => "l\x92issue" 
2.1.0 :003 > s.force_encoding('cp1252')
 => "l\x92issue" 
2.1.0 :004 > s.encode('utf-8')
 => "l’issue"

This is only really a demonstration of what is going on though. The real solution is to make sure you’re handling encodings correctly.

like image 157
matt Avatar answered Oct 05 '22 05:10

matt