I have some problem with UTF-8 conding. I have read some posts here but still it does not work properly somehow.
That is my code:
#!/bin/env ruby
#encoding: utf-8
def determine
file=File.open("/home/lala.txt")
file.each do |line|
puts(line)
type = line.match(/DOG/)
puts('aaaaa')
if type != nil
puts(type[0])
break
end
end
end
That are the first 3 lines of my file :
;?lalalalal60000065535-1362490443-0000006334-0000018467-0000000041en-lalalalallalalalalalalalaln Cell Generation
text/lalalalala1.0.0.1515
text/lalalala�DOG
When I run this code it shows me an error exactly when reading the third line of the file (where the word dog stands):
;?lalalalal60000065535-1362490443-0000006334-0000018467-0000000041en-lalalalallalalalalalalalaln Cell Generation
aaaaa
text/lalalalala1.0.0.1515
aaaaa
text/lalalala�DOG
/home/kik/Desktop/determine2.rb:16:in `match': invalid byte sequence in UTF-8 (ArgumentError)
BUT: if I run just a a determine function with the following content:
#!/bin/env ruby
#encoding: utf-8
def determine
type="text/lalalala�DOG".match(/DOG/)
puts(type)
end
it works perfectly.
What is going wrong there? Thanks in advance!
EDIT: The third line in the file is:
text/lalalal»DOG
BUT when I print the thirf line of the file in ruby it shows up like:
text/lalalala�DOG
EDIT2:
This format was also developed to support localization. Strings stored within the file are stored as 2 byte UNICODE characters.The format of the file is a binary file with data stored in network byte order (big-endian format).
UTF-8 is a byte encoding used to encode unicode characters. UTF-8 uses 1, 2, 3 or 4 bytes to represent a unicode character. Remember, a unicode character is represented by a unicode code point. Thus, UTF-8 uses 1, 2, 3 or 4 bytes to represent a unicode code point.
UTF-8 is an 8-bit variable width encoding. The first 128 characters in the Unicode, when represented with UTF-8 encoding have the representation as the characters in ASCII.
UTF-8 is backward-compatible with ASCII and can represent any standard Unicode character. The first 128 UTF-8 characters precisely match the first 128 ASCII characters (numbered 0-127), meaning that existing ASCII text is already valid UTF-8. All other characters use two to four bytes.
I believe @Amadan is close, but has it backwards. I'd do this:
File.open("/home/lala.txt", "r:ASCII-8BIT")
The character is not valid UTF-8, but for your purposes, it looks like 8-bit ASCII will work fine. My understanding is that Ruby is using that encoding by default when you just use the string, which is why that works.
Update: Based on your most recent comment, it sounds like this is what you need:
File.open("/home/lala.txt", "rb:UTF-16BE")
Try using this:
File.open("/home/lala.txt", "r:UTF-8")
There seems to be an issue with wrong encoding being used at some stage. #encoding :utf
specifies only the encoding of the source file, which affects how the literal string is interpreted, and has no effect on the encoding that File.open
uses.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With