Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to check whether the character is utf-8

How to check whether the character set is in utf-8 encoding,through ruby|ror ?

like image 823
loganathan Avatar asked Dec 26 '11 12:12

loganathan


People also ask

How do I know if a character is UTF-8?

Valid UTF8 has a specific binary format. If it's a single byte UTF8 character, then it is always of form '0xxxxxxx', where 'x' is any binary digit. If it's a two byte UTF8 character, then it's always of form '110xxxxx10xxxxxx'.

How do I know if my file is UTF-16 or UTF-8?

There are a few options you can use: check the content-type to see if it includes a charset parameter which would indicate the encoding (e.g. Content-Type: text/plain; charset=utf-16 ); check if the uploaded data has a BOM (the first few bytes in the file, which would map to the unicode character U+FEFF - 2 bytes for ...

How do you determine the encoding of a character?

One way to check this is to use the W3C Markup Validation Service. The validator usually detects the character encoding from the HTTP headers and information in the document. If the validator fails to detect the encoding, it can be selected on the validator result page via the 'Encoding' pulldown menu (example).

What characters are not included in UTF-8?

0xC0, 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xFF are invalid UTF-8 code units. A UTF-8 code unit is 8 bits.


1 Answers

Check UTF-8 Validity

For most multi-byte encodings it is possible to programmatically detect invalid byte-sequences. As Ruby by default treats all strings to be UTF-8, you can check if a string is given in valid UTF-8:

# encoding: UTF-8
# -------------------------------------------
str = "Partly valid\xE4 UTF-8 encoding: äöüß"

str.valid_encoding?
   # => false

str.scrub('').valid_encoding?
   # => true

Convert Encoding

Additionally, if a string is not valid UTF-8 encoding, but you know the actual character-encoding, you can convert the string to UTF-8 encoding.

Example
Sometimes, you end up in a situation, in which you know that the encoding of an input-file is either UTF-8 or CP1252 (a.k.a. Windows-1252).
Check which encoding it is and convert to UTF-8 (if necessary):

# encoding: UTF-8
# ------------------------------------------------------
test = "String in CP1252 encoding: \xE4\xF6\xFC\xDF"
File.open( 'input_file', 'w' ) {|f| f.write(test)}

str  = File.read( 'input_file' )

unless str.valid_encoding?
  str.encode!( 'UTF-8', 'CP1252', invalid: :replace, undef: :replace, replace: '?' )
end #unless
   # => "String CP1252 encoding: äöüß"

=======
Notes

  • It is programmatically possible to detect most multi-byte encodings like UTF-8 (in Ruby, see: #valid_encoding?) with pretty high reliability. After only 16 bytes, the probability of a random byte-sequence being valid UTF-8 is only 0.01%. (Compare this with relying on the UTF-8 BOM)

  • However, it is NOT easily possible to programmatically detect (in)validity of single-byte-encodings like CP1252 or ISO-8859-1. Thus the above code snippet does not work the other way around, i.e. detecting if a String is valid CP1252 encoding.

  • Even though UTF-8 has become increasingly popular as the default encoding in the web, CP1252 and other Latin1 flavors are still very popular in the Western countries, especially in North America. Be aware that there a several single-byte encodings out there that are very similar, but slightly vary from CP1252 (a.k.a. Windows-1252). Examples: ISO-8859-1, ISO-8859-15

like image 62
Andreas Rayo Kniep Avatar answered Sep 28 '22 02:09

Andreas Rayo Kniep