Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the best way to parse html in google apps script

var page = UrlFetchApp.fetch(contestURL); var doc = XmlService.parse(page); 

The above code gives a parse error when used, however if I replace the XmlService class with the deprecated Xml class, with the lenient flag set, it parses the html properly.

var page = UrlFetchApp.fetch(contestURL); var doc = Xml.parse(page, true); 

The problem is mostly caused because of no CDATA in the javascript part of the html and the parser complains with the following error.

The entity name must immediately follow the '&' in the entity reference. 

Even if I remove all the <script>(.*?)</script> using regex, it still complains because the <br> tags aren't closed. Is there a clean way of parsing html into a DOM tree.

like image 805
copperhead Avatar asked Oct 18 '13 17:10

copperhead


People also ask

How do you parse HTML?

If you just want to parse HTML and your HTML is intended for the body of your document, you could do the following : (1) var div=document. createElement("DIV"); (2) div. innerHTML = markup; (3) result = div. childNodes; --- This gives you a collection of childnodes and should work not just in IE8 but even in IE6-7.

What library is suitable for parsing HTML?

Html5lib. html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers. Html5lib it is considered a good library to parse HTML5 and a very slow one.

Can we parse HTML?

jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors. jsoup can manipulate the content: the HTML element itself, its attributes, or its text.


2 Answers

I ran into this exact same problem. I was able to circumvent it by first using the deprecated Xml.parse, since it still works, then selecting the body XmlElement, then passing in its Xml String into the new XmlService.parse method:

var page = UrlFetchApp.fetch(contestURL); var doc = Xml.parse(page, true); var bodyHtml = doc.html.body.toXmlString(); doc = XmlService.parse(bodyHtml); var root = doc.getRootElement(); 

Note: This solution may not work if the old Xml.parse is completely removed from Google Scripts.

like image 112
Justin Bicknell Avatar answered Sep 22 '22 21:09

Justin Bicknell


In 2021, the best way to parse HTML on the .gs side that I know of is...

  1. Click + next to Library
  2. Enter 1ReeQ6WO8kKNxoaA_O0XEQ589cIrRvEBA9qcWpNqdOP17i47u6N9M5Xh0
  3. Click "Look up"
  4. Click Add
  5. Sample usage:
const contentText = UrlFetchApp.fetch('https://www.somesite.com/').getContentText(); const $ = Cheerio.load(contentText);  $('.some-class').first().text(); 

That's it -- this is probably the closest we'll get to doing jQuery-like DOM selection in GAS. The .first() is important or else you may extract more content than you expected (think of it as using querySelector() instead of querySelectorAll()).

Credit where credit is due: https://github.com/tani/cheeriogs

like image 33
thdoan Avatar answered Sep 21 '22 21:09

thdoan