Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Anyone knows what's going on with "weird" characters at the beginning of a string?

Tags:

ruby

I've just run into a strange issue when trying to detect a certain string among an array of them. Anyone knows what's going on here?

(rdb:1) p magic_string
"Time Period"
(rdb:1) p magic_string.class
String
(rdb:1) p magic_string == "Time Period"
false
(rdb:1) p "Time Period".length
11
(rdb:1) p magic_string.length
14
(rdb:1) p magic_string[0].chr
"\357"
(rdb:1) p magic_string[1].chr
"\273"
(rdb:1) p magic_string[2].chr
"\277"
(rdb:1) p magic_string[3].chr
"T"
like image 910
Carlos Murdock Avatar asked Jul 13 '15 16:07

Carlos Murdock


2 Answers

Your string contains 3 bytes (BOM) at the beginning to indicate that encoding is UTF-8.

Q: What is a BOM?

A: A byte order mark (BOM) consists of the character code U+FEFF at the beginning of a data stream, where it can be used as a signature defining the byte order and encoding form, primarily of unmarked plaintext files. Under some higher level protocols, use of a BOM may be mandatory (or prohibited) in the Unicode data stream defined in that protocol.

source

like image 105
shivam Avatar answered Sep 22 '22 11:09

shivam


This might help you to understand what's happening:

# encoding: UTF-8
RUBY_VERSION # => "1.9.3"
magic_string = "Time Period"
magic_string[0].chr # => "\uFEFF"

The same output is true with Ruby v2.2.2.

Older versions of Ruby didn't default to UTF-8 and treated strings as an array of bytes. The encoding line is important to tell it what the script's strings' encoding is.

Ruby now correctly treats Strings as arrays of characters not bytes, which is why it reports the first character as "\uFEFF", a two-byte character.

"\uFEFF" and "\uFFFE" are BOM markers showing which "endian" the characters are. Endianness is tied to the CPU's idea of what a most significant and least significant byte is in a word (two bytes typically). This is also tied to Unicode, both of which are something you need to understand, at least in a rudimentary way as we don't deal with only ASCII any more, and languages don't consist of only the Latin character set.

UTF-8 is an multibyte character set that incorporates a huge number of characters from multiple languages. You can also run into UTF-16LE, UTF-16BE or longer; HTML and documents on the internet can be encoded in varying lengths of characters depending on the originating hardware and not being aware of those can drive you nuts and you'll go down the wrong paths trying to read their content. It's important to read the IO class documentation for "IO Encoding" to know the right way to deal with these types of files.

like image 33
the Tin Man Avatar answered Sep 19 '22 11:09

the Tin Man