Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

`scan': invalid byte sequence in UTF-8 (ArgumentError)

I'm trying to read a .txt file in ruby and split the text line-by-line.

Here is my code:

def file_read(filename)
  File.open(filename, 'r').read
end

puts f = file_read('alice_in_wonderland.txt')

This works perfectly. But when I add the method line_cutter like this:

def file_read(filename)
  File.open(filename, 'r').read
end

def line_cutter(file)
  file.scan(/\w/)
end

puts f = line_cutter(file_read('alice_in_wonderland.txt'))

I get an error:

`scan': invalid byte sequence in UTF-8 (ArgumentError)

I found this online for untrusted website and tried to use it for my own code but it's not working. How can I remove this error?

Link to the file: File

like image 671
anonn023432 Avatar asked Mar 18 '16 14:03

anonn023432


People also ask

What is invalid byte sequence in UTF 8?

Why does an UTF-8 invalid byte sequence error happen? Ruby's default encoding since 2.0 is UTF-8. This means that Ruby will treat any string you input as an UTF-8 encoded string unless you tell it explicitly that it's encoded differently. Let's use the Å character from the introductory diagram to present this problem.

What is an invalid byte?

Explanation: This error occurs when you send text data, but either the source encoding doesn't match that currently set on the database, or the text stream contains binary data like NUL bytes that are not allowed within a string.


2 Answers

The linked text file contains the following line:

Character set encoding: ISO-8859-1

If converting it isn't desired or possible then you have to tell Ruby that this file is ISO-8859-1 encoded. Otherwise the default external encoding is used (UTF-8 in your case). A possible way to do that is:

s = File.read('alice_in_wonderland.txt', encoding: 'ISO-8859-1')
s.encoding  # => #<Encoding:ISO-8859-1>

Or even like this if you prefer your string UTF-8 encoded (see utf8everywhere.org):

s = File.read('alice_in_wonderland.txt', encoding: 'ISO-8859-1:UTF-8')
s.encoding  # => #<Encoding:UTF-8>
like image 63
cremno Avatar answered Sep 22 '22 00:09

cremno


It seems to work if you read the file directly from the page, maybe there's something funny about the local copy you have. Try this:

require 'net/http'

uri = 'http://www.ccs.neu.edu/home/vip/teach/Algorithms/7_hash_RBtree_simpleDS/hw_hash_RBtree/alice_in_wonderland.txt'
scanned = Net::HTTP.get_response(URI.parse(uri)).body.scan(/\w/)
like image 42
JLB Avatar answered Sep 21 '22 00:09

JLB