Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ruby, `match': invalid byte sequence in UTF-8

Tags:

ruby

utf-8

I have some problem with UTF-8 conding. I have read some posts here but still it does not work properly somehow.

That is my code:

#!/bin/env ruby
#encoding: utf-8

def determine
  file=File.open("/home/lala.txt")          
  file.each do |line|           
    puts(line)
    type = line.match(/DOG/)
    puts('aaaaa')

    if type != nil 
      puts(type[0])
      break
    end        

  end
end

That are the first 3 lines of my file :

;?lalalalal60000065535-1362490443-0000006334-0000018467-0000000041en-lalalalallalalalalalalalaln Cell Generation
text/lalalalala1.0.0.1515
text/lalalala�DOG

When I run this code it shows me an error exactly when reading the third line of the file (where the word dog stands):

;?lalalalal60000065535-1362490443-0000006334-0000018467-0000000041en-lalalalallalalalalalalalaln Cell Generation
aaaaa

text/lalalalala1.0.0.1515
aaaaa

text/lalalala�DOG
/home/kik/Desktop/determine2.rb:16:in `match': invalid byte sequence in UTF-8 (ArgumentError)

BUT: if I run just a a determine function with the following content:

#!/bin/env ruby
#encoding: utf-8

    def determine
    type="text/lalalala�DOG".match(/DOG/)
    puts(type)
end

it works perfectly.

What is going wrong there? Thanks in advance!

EDIT: The third line in the file is:

text/lalalal»DOG

BUT when I print the thirf line of the file in ruby it shows up like:

text/lalalala�DOG

EDIT2:

This format was also developed to support localization. Strings stored within the file are stored as 2 byte UNICODE characters.The format of the file is a binary file with data stored in network byte order (big-endian format).

like image 937
Alina Avatar asked Mar 14 '13 01:03

Alina


People also ask

What is a UTF-8 byte?

UTF-8 is a byte encoding used to encode unicode characters. UTF-8 uses 1, 2, 3 or 4 bytes to represent a unicode character. Remember, a unicode character is represented by a unicode code point. Thus, UTF-8 uses 1, 2, 3 or 4 bytes to represent a unicode code point.

Does UTF-8 use 8bits?

UTF-8 is an 8-bit variable width encoding. The first 128 characters in the Unicode, when represented with UTF-8 encoding have the representation as the characters in ASCII.

Is a valid UTF-8 character?

UTF-8 is backward-compatible with ASCII and can represent any standard Unicode character. The first 128 UTF-8 characters precisely match the first 128 ASCII characters (numbered 0-127), meaning that existing ASCII text is already valid UTF-8. All other characters use two to four bytes.


2 Answers

I believe @Amadan is close, but has it backwards. I'd do this:

File.open("/home/lala.txt", "r:ASCII-8BIT")

The character is not valid UTF-8, but for your purposes, it looks like 8-bit ASCII will work fine. My understanding is that Ruby is using that encoding by default when you just use the string, which is why that works.

Update: Based on your most recent comment, it sounds like this is what you need:

File.open("/home/lala.txt", "rb:UTF-16BE")
like image 77
Darshan Rivka Whittle Avatar answered Nov 15 '22 06:11

Darshan Rivka Whittle


Try using this:

File.open("/home/lala.txt", "r:UTF-8")

There seems to be an issue with wrong encoding being used at some stage. #encoding :utf specifies only the encoding of the source file, which affects how the literal string is interpreted, and has no effect on the encoding that File.open uses.

like image 24
Amadan Avatar answered Nov 15 '22 04:11

Amadan