I'm using Rails 5 with Ruby 4.2 and scanning a document that I parsed with Nokogiri, looking in a case insensitive way for a link with text:
a_elt = doc ? doc.xpath('//a').detect { |node| /link[[:space:]]+text/i === node.text } : nil
After getting the HTML of my web page in content
, I parse it into a Nokogiri doc using:
doc = Nokogiri::HTML(content)
The problem is, I'm getting
ArgumentError invalid byte sequence in UTF-8
on certain web pages when using the above regular expression.
2.4.0 :002 > doc.encoding
=> "UTF-8"
2.4.0 :003 > doc.xpath('//a').detect { |node| /individual[[:space:]]+results/i === node.text }
ArgumentError: invalid byte sequence in UTF-8
from (irb):3:in `==='
from (irb):3:in `block in irb_binding'
from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/nokogiri-1.7.0/lib/nokogiri/xml/node_set.rb:187:in `block in each'
from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/nokogiri-1.7.0/lib/nokogiri/xml/node_set.rb:186:in `upto'
from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/nokogiri-1.7.0/lib/nokogiri/xml/node_set.rb:186:in `each'
from (irb):3:in `detect'
from (irb):3
from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/console.rb:65:in `start'
from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/console_helper.rb:9:in `start'
from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/commands_tasks.rb:78:in `console'
from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/commands_tasks.rb:49:in `run_command!'
from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands.rb:18:in `<top (required)>'
from bin/rails:4:in `require'
from bin/rails:4:in `<main>'
Is there a way I can rewrite the above to automatically account for the encoding or weird characters and not flip out?
Your question may have already been answered before. Have you tried the methods from "Is there any way to clean a file of "invalid byte sequence in UTF-8" errors in Ruby?"?
Specifically before the detect
block, try to remove the invalid bytes and control characters except new line:
doc.scrub!("")
doc.gsub!(/[[:cntrl:]&&[^\n\r]]/,"")
Remember, scrub!
is a Ruby 2.1+ method.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With