I've just run into a strange issue when trying to detect a certain string among an array of them. Anyone knows what's going on here?
(rdb:1) p magic_string
"Time Period"
(rdb:1) p magic_string.class
String
(rdb:1) p magic_string == "Time Period"
false
(rdb:1) p "Time Period".length
11
(rdb:1) p magic_string.length
14
(rdb:1) p magic_string[0].chr
"\357"
(rdb:1) p magic_string[1].chr
"\273"
(rdb:1) p magic_string[2].chr
"\277"
(rdb:1) p magic_string[3].chr
"T"
Your string contains 3 bytes (BOM) at the beginning to indicate that encoding is UTF-8.
Q: What is a BOM?
A: A byte order mark (BOM) consists of the character code U+FEFF at the beginning of a data stream, where it can be used as a signature defining the byte order and encoding form, primarily of unmarked plaintext files. Under some higher level protocols, use of a BOM may be mandatory (or prohibited) in the Unicode data stream defined in that protocol.
source
This might help you to understand what's happening:
# encoding: UTF-8
RUBY_VERSION # => "1.9.3"
magic_string = "Time Period"
magic_string[0].chr # => "\uFEFF"
The same output is true with Ruby v2.2.2.
Older versions of Ruby didn't default to UTF-8 and treated strings as an array of bytes. The encoding
line is important to tell it what the script's strings' encoding is.
Ruby now correctly treats Strings as arrays of characters not bytes, which is why it reports the first character as "\uFEFF"
, a two-byte character.
"\uFEFF"
and "\uFFFE"
are BOM markers showing which "endian" the characters are. Endianness is tied to the CPU's idea of what a most significant and least significant byte is in a word (two bytes typically). This is also tied to Unicode, both of which are something you need to understand, at least in a rudimentary way as we don't deal with only ASCII any more, and languages don't consist of only the Latin character set.
UTF-8 is an multibyte character set that incorporates a huge number of characters from multiple languages. You can also run into UTF-16LE, UTF-16BE or longer; HTML and documents on the internet can be encoded in varying lengths of characters depending on the originating hardware and not being aware of those can drive you nuts and you'll go down the wrong paths trying to read their content. It's important to read the IO class documentation for "IO Encoding" to know the right way to deal with these types of files.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With