Convert escaped unicode (\u008E) to accented character (Ž) in Ruby?

Tags:

encoding

I am having a very difficult time with this:

# contained within:
"MA\u008EEIKIAI"

# should be
"MAŽEIKIAI"

# nature of string
$ p string3
"MA\u008EEIKIAI" 

$ puts string3
MAEIKIAI

$ string3.inspect
"\"MA\\u008EEIKIAI\""

$ string3.bytes
#<Enumerator: "MA\u008EEIKIAI":bytes>

Any ideas on where to start?

Note: this is not a duplicate of my previous question.

436

asked Jun 11 '13 12:06

2 Answers

\u008E means that the unicode character with the codepoint 8e (in hex) appears at that point in the string. This character is the control character “SINGLE SHIFT TWO” (see the code chart (pdf)). The character Ž is at the codepoint u017d. However it is at position 8e in the Windows CP-1252 encoding. Somehow you’ve got your encodings mixed up.

The easiest way to “fix” this is probably just to open the file containing the string (or the database record or whatever) and edit it to be correct. The real solution will depend on where the string in question came from and how many bad strings you have.

Assuming the string is in UTF-8 encoding, \u008E will consist of the two bytes c2 and 8e. Note that the second byte, 8e, is the same as the encoding of Ž in CP-1252. On way to convert the string would be something like this:

Click to copy

string3.force_encoding('BINARY') # treat the string just as bytes for now
string3.gsub!(/\xC2/n, '')       # remove the C2 byte
string3.force_encoding('CP1252') # give the string the correct encoding
string3.encode('UTF-8')          # convert to the desired encoding

Note that this isn’t a general solution to fix all issues like this. Not all CP-1252 characters, when mangled and expressed in UTF-8 this way will amenable to conversion like this. Some will be two bytes c2 xx where xx the correct byte (like in this case), others will be c3 yy where yy is a different byte.

answered Sep 30 '22 09:09

matt

What about using Regexp & String#pack to convert the Unicode escape?

Click to copy

str = "MA\\u008EEIKIAI"
puts str    #=> MA\u008EEIKIAI

str.gsub!(/\\u(.{4})/) do |match|
  [$1.to_i(16)].pack('U')
end
puts str    #=> MA EIKIAI

answered Sep 30 '22 09:09

Arie Xiao

Related questions
                            
                                Reading xml file using REXML, says <UNDEFINED> ... </>
                            
                                How to serve static files? (css)
                            
                                Ruby - get bit range from variable
                            
                                RVM With JRuby 1.7.0 "Unknown Ruby Interpreter"
                            
                                Naming element "text" with Nokogiri and Ruby
                            
                                Find both pattern and position of multiple regex matches in Ruby
                            
                                Why is my rails development environment not reloading on changes to code?
                            
                                How to pretty print to a variable instead of STDOUT?
                            
                                bundle exec rake test does nothing
                            
                                if condition vs &&, is there any performance gain
                            
                                Ruby's Faraday - include the same param multiple times
                            
                                Creating only one log every day using Ruby standard Logger
                            
                                How can I get the keys in a multidimensional hash in Ruby?
                            
                                Cucumber / Savon omit or remove logging output
                            
                                rbenv: command not found in Jenkins execute shell after moving to rbenv from RVM
                            
                                Ruby evaluate without eval?
                            
                                How to convert to big endian in ruby
                            
                                How to test a mixed-in class method is being called with RSpec and Mocha?
                            
                                Rails: Wrap parameter coming from form in a nested hash
                            
                                Pluck multiple and/or nested fields on mongoid

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Convert escaped unicode (\u008E) to accented character (Ž) in Ruby?

Tags:

ruby

encoding

Damien Roche

People also ask

2 Answers

matt

Arie Xiao

Recent Activity

Donate For Us