How to check whether the character set is in utf-8 encoding,through ruby|ror ?

<h3>Check UTF-8 Validity</h3> For most multi-byte encodings it is possible to programmatically detect invalid byte-sequences. As Ruby by default treats all strings to be <code>UTF-8</code>, you can check if a string is given in valid <code>UTF-8</code>: <pre class="prettyprint"><code># encoding: UTF-8 # ------------------------------------------- str = "Partly valid\xE4 UTF-8 encoding: äöüß" str.valid_encoding? # => false str.scrub('').valid_encoding? # => true </code></pre> <h3>Convert Encoding</h3> Additionally, if a string is not valid <code>UTF-8</code> encoding, but you know the actual character-encoding, you can convert the string to <code>UTF-8</code> encoding. Example Sometimes, you end up in a situation, in which you know that the encoding of an input-file is either <code>UTF-8</code> or <code>CP1252</code> (a.k.a. <code>Windows-1252</code>). Check which encoding it is and convert to UTF-8 (if necessary): <pre class="prettyprint"><code># encoding: UTF-8 # ------------------------------------------------------ test = "String in CP1252 encoding: \xE4\xF6\xFC\xDF" File.open( 'input_file', 'w' ) {|f| f.write(test)} str = File.read( 'input_file' ) unless str.valid_encoding? str.encode!( 'UTF-8', 'CP1252', invalid: :replace, undef: :replace, replace: '?' ) end #unless # => "String CP1252 encoding: äöüß" </code></pre> ======= Notes <ul> <li>It is programmatically possible to detect most multi-byte encodings like UTF-8 (in Ruby, see: #valid_encoding?) with pretty high reliability. After only 16 bytes, the probability of a random byte-sequence being valid UTF-8 is only 0.01%. (Compare this with relying on the UTF-8 BOM)</li> <li>However, it is NOT easily possible to programmatically detect (in)validity of single-byte-encodings like <code>CP1252</code> or <code>ISO-8859-1</code>. Thus the above code snippet does not work the other way around, i.e. detecting if a String is valid <code>CP1252</code> encoding.</li> <li>Even though UTF-8 has become increasingly popular as the default encoding in the web, <code>CP1252</code> and other <code>Latin1</code> flavors are still very popular in the Western countries, especially in North America. Be aware that there a several single-byte encodings out there that are very similar, but slightly vary from <code>CP1252</code> (a.k.a. <code>Windows-1252</code>). Examples: <code>ISO-8859-1</code>, <code>ISO-8859-15</code></li> </ul>

How to check whether the character is utf-8

1 Answers

Check UTF-8 Validity

For most multi-byte encodings it is possible to programmatically detect invalid byte-sequences. As Ruby by default treats all strings to be UTF-8, you can check if a string is given in valid UTF-8:

Click to copy

# encoding: UTF-8
# -------------------------------------------
str = "Partly valid\xE4 UTF-8 encoding: äöüß"

str.valid_encoding?
   # => false

str.scrub('').valid_encoding?
   # => true

Convert Encoding

Additionally, if a string is not valid UTF-8 encoding, but you know the actual character-encoding, you can convert the string to UTF-8 encoding.

Example
Sometimes, you end up in a situation, in which you know that the encoding of an input-file is either UTF-8 or CP1252 (a.k.a. Windows-1252).
Check which encoding it is and convert to UTF-8 (if necessary):

Click to copy

# encoding: UTF-8
# ------------------------------------------------------
test = "String in CP1252 encoding: \xE4\xF6\xFC\xDF"
File.open( 'input_file', 'w' ) {|f| f.write(test)}

str  = File.read( 'input_file' )

unless str.valid_encoding?
  str.encode!( 'UTF-8', 'CP1252', invalid: :replace, undef: :replace, replace: '?' )
end #unless
   # => "String CP1252 encoding: äöüß"

=======
Notes

It is programmatically possible to detect most multi-byte encodings like UTF-8 (in Ruby, see: #valid_encoding?) with pretty high reliability. After only 16 bytes, the probability of a random byte-sequence being valid UTF-8 is only 0.01%. (Compare this with relying on the UTF-8 BOM)
However, it is NOT easily possible to programmatically detect (in)validity of single-byte-encodings like CP1252 or ISO-8859-1. Thus the above code snippet does not work the other way around, i.e. detecting if a String is valid CP1252 encoding.
Even though UTF-8 has become increasingly popular as the default encoding in the web, CP1252 and other Latin1 flavors are still very popular in the Western countries, especially in North America. Be aware that there a several single-byte encodings out there that are very similar, but slightly vary from CP1252 (a.k.a. Windows-1252). Examples: ISO-8859-1, ISO-8859-15

answered Sep 28 '22 02:09

Andreas Rayo Kniep

Related questions
                            
                                Python or Ruby for a .NET developer? [closed]
                            
                                rails + compass: advantages vs using haml + blueprint directly
                            
                                Generate pdf from Rails 3 - what tool to choose?
                            
                                RAILS/DEVISE - Setting a devise cookie to persist across different subdomains
                            
                                How to include nested and sibling associations in active record to_json?
                            
                                how to output the current protocol and url using rails?
                            
                                Why is this permissions error occurring with mod_passenger.so?
                            
                                How do I setup/use ruby on rails snippets and autocomplete in sublime text 2?
                            
                                Ruby VCR gem keeps recording the same requests
                            
                                Exclude option from collection.map in Ruby on Rails?
                            
                                "nil is not a symbol" for model count in rspec matcher
                            
                                String interpolation in HTML attributes in an ERB file
                            
                                Gem::Installer::ExtensionBuildError: ERROR: Failed to build gem native extension ubuntu
                            
                                Does Rails initializers gets called when I run rails console
                            
                                Nested layouts in ruby on rails
                            
                                raise ActiveRecord::RecordNotFound (or any 404 status) for invalid date
                            
                                Create join table with no primary key
                            
                                Ruby on Rails 3 + Apache2 + Phusion Passenger: Bundler::GemNotFound exception
                            
                                Bundle update & install fail after OS X lion upgrade - Rails 3
                            
                                want rails simple_form radio button to display text that's not the value

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to check whether the character is utf-8

Tags:

ruby

ruby-on-rails

loganathan

People also ask

1 Answers

Check UTF-8 Validity

Convert Encoding

Andreas Rayo Kniep

Recent Activity

Donate For Us