Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ruby Invalid Byte Sequence in UTF-8

Tags:

I have the following code, which gives me an invalid byte sequence error pointing to the scan method in initialize. Any ideas on how to fix this? For what it's worth, the error does not occur when the (.*) between the h1 tag and the closing > is not there.

#!/usr/bin/env ruby  class NewsParser    def initialize       Dir.glob("./**/index.htm") do |file|         @file = IO.read file          parsed = @file.scan(/<h1(.*)>(.*?)<\/h1>(.*)<!-- InstanceEndEditable -->/im)         self.write(parsed)       end   end    def write output     @contents = output     open('output.txt', 'a') do |f|        f << @contents[0][0]+"\n\n"+@contents[0][1]+"\n\n\n\n"      end   end  end  p = NewsParser.new 

Edit: Here is the error message:

news_parser.rb:10:in 'scan': invalid byte sequence in UTF-8 (ArgumentError)

SOLVED: The combination of using: @file = IO.read(file).force_encoding("ISO-8859-1").encode("utf-8", replace: nil) and encoding: UTF-8 solve the issue.

Thanks!

like image 636
redgem Avatar asked Mar 07 '12 19:03

redgem


People also ask

Is UTF-8 a byte?

UTF-8 uses one byte to represent code points from 0-127. These first 128 Unicode code points correspond one-to-one with ASCII character mappings, so ASCII characters are also valid UTF-8 characters.

What is an invalid byte?

Explanation: This error occurs when you send text data, but either the source encoding doesn't match that currently set on the database, or the text stream contains binary data like NUL bytes that are not allowed within a string.

What is the hex byte code UTF-8 for?

UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit.

Is UTF-8 a multi byte?

UTF-8 is a multibyte encoding able to encode the whole Unicode charset. An encoded character takes between 1 and 4 bytes. UTF-8 encoding supports longer byte sequences, up to 6 bytes, but the biggest code point of Unicode 6.0 (U+10FFFF) only takes 4 bytes.


1 Answers

The combination of using: @file = IO.read(file).force_encoding("ISO-8859-1").encode("utf-8", replace: nil) and #encoding: UTF-8 solved the issue.

like image 125
redgem Avatar answered Oct 21 '22 03:10

redgem