I have the following code, which gives me an invalid byte sequence error pointing to the scan method in initialize
. Any ideas on how to fix this? For what it's worth, the error does not occur when the (.*)
between the h1 tag and the closing >
is not there.
#!/usr/bin/env ruby class NewsParser def initialize Dir.glob("./**/index.htm") do |file| @file = IO.read file parsed = @file.scan(/<h1(.*)>(.*?)<\/h1>(.*)<!-- InstanceEndEditable -->/im) self.write(parsed) end end def write output @contents = output open('output.txt', 'a') do |f| f << @contents[0][0]+"\n\n"+@contents[0][1]+"\n\n\n\n" end end end p = NewsParser.new
Edit: Here is the error message:
news_parser.rb:10:in 'scan': invalid byte sequence in UTF-8 (ArgumentError)
SOLVED: The combination of using: @file = IO.read(file).force_encoding("ISO-8859-1").encode("utf-8", replace: nil)
and encoding: UTF-8
solve the issue.
Thanks!
UTF-8 uses one byte to represent code points from 0-127. These first 128 Unicode code points correspond one-to-one with ASCII character mappings, so ASCII characters are also valid UTF-8 characters.
Explanation: This error occurs when you send text data, but either the source encoding doesn't match that currently set on the database, or the text stream contains binary data like NUL bytes that are not allowed within a string.
UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit.
UTF-8 is a multibyte encoding able to encode the whole Unicode charset. An encoded character takes between 1 and 4 bytes. UTF-8 encoding supports longer byte sequences, up to 6 bytes, but the biggest code point of Unicode 6.0 (U+10FFFF) only takes 4 bytes.
The combination of using: @file = IO.read(file).force_encoding("ISO-8859-1").encode("utf-8", replace: nil)
and #encoding: UTF-8
solved the issue.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With