I’m using Rails 4.2.7. I’m currently using the following logic to parse a doc with Nokogiri:
content.xpath("//pre[@class='text-results']").xpath('text()').to_s
In my HTML document, this content appears within my “text-results” block:
<pre class="text-results"><html xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta name=Title content="<p><a href=http://mychiptime">
<meta name=Keywords content="">
<meta http-equiv=Content-Type content="text/html; charset=macintosh”>…
I include this section because my parsing dies with the following error:
Error during processing: unknown encoding name - macintosh
/Users/davea/.rvm/gems/ruby-2.3.0/gems/nokogiri-1.6.8.1/lib/nokogiri/xml/node.rb:627:in `find'
/Users/davea/.rvm/gems/ruby-2.3.0/gems/nokogiri-1.6.8.1/lib/nokogiri/xml/node.rb:627:in `serialize'
/Users/davea/.rvm/gems/ruby-2.3.0/gems/nokogiri-1.6.8.1/lib/nokogiri/xml/node.rb:786:in `to_format'
/Users/davea/.rvm/gems/ruby-2.3.0/gems/nokogiri-1.6.8.1/lib/nokogiri/xml/node.rb:642:in `to_html'
/Users/davea/.rvm/gems/ruby-2.3.0/gems/nokogiri-1.6.8.1/lib/nokogiri/xml/node.rb:512:in `to_s'
/Users/davea/.rvm/gems/ruby-2.3.0/gems/nokogiri-1.6.8.1/lib/nokogiri/xml/node_set.rb:187:in `block in each'
/Users/davea/.rvm/gems/ruby-2.3.0/gems/nokogiri-1.6.8.1/lib/nokogiri/xml/node_set.rb:186:in `upto'
/Users/davea/.rvm/gems/ruby-2.3.0/gems/nokogiri-1.6.8.1/lib/nokogiri/xml/node_set.rb:186:in `each'
/Users/davea/.rvm/gems/ruby-2.3.0/gems/nokogiri-1.6.8.1/lib/nokogiri/xml/node_set.rb:218:in `map'
/Users/davea/.rvm/gems/ruby-2.3.0/gems/nokogiri-1.6.8.1/lib/nokogiri/xml/node_set.rb:218:in `to_s'
/Users/davea/Documents/workspace/myproject/app/services/onlinerr_service.rb:8:in `pre_process_data'
/Users/davea/Documents/workspace/myproject/app/services/abstract_import_service.rb:77:in `process_my_object_data'
/Users/davea/Documents/workspace/myproject/app/services/onlinerr_my_object_finder_service.rb:82:in `process_my_object_link'
/Users/davea/Documents/workspace/myproject/app/services/abstract_my_object_finder_service.rb:29:in `block in process_data'
/Users/davea/Documents/workspace/myproject/app/services/abstract_my_object_finder_service.rb:28:in `each'
/Users/davea/Documents/workspace/myproject/app/services/abstract_my_object_finder_service.rb:28:in `process_data'
/Users/davea/Documents/workspace/myproject/app/services/run_crawlers_service.rb:18:in `block in run_all_crawlers'
/Users/davea/.rvm/gems/ruby-2.3.0/gems/activerecord-4.2.7.1/lib/active_record/relation/delegation.rb:46:in `each'
/Users/davea/.rvm/gems/ruby-2.3.0/gems/activerecord-4.2.7.1/lib/active_record/relation/delegation.rb:46:in `each'
/Users/davea/Documents/workspace/myproject/app/services/run_crawlers_service.rb:5:in `run_all_crawlers'
Is there any way to make Nokogiri ignore this unknown encoding? I’m trying to get the content inside the <pre> tag as text, so I don’t need it parsed further.
I'm on Mac El Capitan. Per the comment, here's my locale settings:
davea$ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=
Your HTML is invalid. You have a <pre> tag outside the <body> and, as a result, Nokogiri is having to do fixups which usually results in questionable results.
This is what Nokogiri has to say about the document:
doc.errors # => [#<Nokogiri::XML::SyntaxError: htmlParseStartTag: misplaced <html> tag>, #<Nokogiri::XML::SyntaxError: htmlParseStartTag: misplaced <head> tag>, #<Nokogiri::XML::SyntaxError: AttValue: " expected>, #<Nokogiri::XML::SyntaxError: Couldn't find end of Start Tag meta>]
doc.to_html # => "<pre class=\"text-results\">\n\n\n<meta name=\"Title\" content=\"<p><a href=http://mychiptime\">\n<meta name=\"Keywords\" content=\"\">\n<meta http-equiv=\"Content-Type\" content=\"text/html; charset=macintosh”>\n</head>\n\"></pre>"
Looking at only the line in question, it's also confusing Nokogiri:
doc = Nokogiri::HTML::DocumentFragment.parse('<meta http-equiv=Content-Type content="text/html; charset=macintosh”>')
doc.errors # => [#<Nokogiri::XML::SyntaxError: AttValue: " expected>, #<Nokogiri::XML::SyntaxError: Couldn't find end of Start Tag meta>]
doc.to_html # => "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=macintosh”>\">"
Notice that Nokogiri doesn't recognize a closing curly-quote as a terminator for the string content="text/html; charset=macintosh”.
You can't fix this within Nokogiri. You'll need to provide the appropriate structure, and need to do a search and replace to convert curly quotes prior to parsing the document. Hopefully the document won't contain them inside the <body> in text or you'll be altering text which might be a problem for your use.
The fact you have curly-quotes in places they shouldn't exist is curious. If your editor is converting from straight quotes to curly quotes then you need to immediately turn off that feature as it'll cause real havoc with coding. Good text editors for coding won't even offer the use of curly quotes because of the problems they cause.
Nokogiri is complaining about the "macintosh" sequence as far as I can tell.
require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse('<meta http-equiv=Content-Type content="text/html; charset=macintosh">')
doc.at('meta')['content'] # => "text/html; charset=macintosh"
If the HTML is clean it doesn't care.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With