var page = UrlFetchApp.fetch(contestURL); var doc = XmlService.parse(page);
The above code gives a parse error when used, however if I replace the XmlService class with the deprecated Xml class, with the lenient flag set, it parses the html properly.
var page = UrlFetchApp.fetch(contestURL); var doc = Xml.parse(page, true);
The problem is mostly caused because of no CDATA in the javascript part of the html and the parser complains with the following error.
The entity name must immediately follow the '&' in the entity reference.
Even if I remove all the <script>(.*?)</script>
using regex, it still complains because the <br>
tags aren't closed. Is there a clean way of parsing html into a DOM tree.
If you just want to parse HTML and your HTML is intended for the body of your document, you could do the following : (1) var div=document. createElement("DIV"); (2) div. innerHTML = markup; (3) result = div. childNodes; --- This gives you a collection of childnodes and should work not just in IE8 but even in IE6-7.
Html5lib. html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers. Html5lib it is considered a good library to parse HTML5 and a very slow one.
jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors. jsoup can manipulate the content: the HTML element itself, its attributes, or its text.
I ran into this exact same problem. I was able to circumvent it by first using the deprecated Xml.parse
, since it still works, then selecting the body XmlElement, then passing in its Xml String into the new XmlService.parse
method:
var page = UrlFetchApp.fetch(contestURL); var doc = Xml.parse(page, true); var bodyHtml = doc.html.body.toXmlString(); doc = XmlService.parse(bodyHtml); var root = doc.getRootElement();
Note: This solution may not work if the old Xml.parse
is completely removed from Google Scripts.
In 2021, the best way to parse HTML on the .gs
side that I know of is...
const contentText = UrlFetchApp.fetch('https://www.somesite.com/').getContentText(); const $ = Cheerio.load(contentText); $('.some-class').first().text();
That's it -- this is probably the closest we'll get to doing jQuery-like DOM selection in GAS. The .first()
is important or else you may extract more content than you expected (think of it as using querySelector()
instead of querySelectorAll()
).
Credit where credit is due: https://github.com/tani/cheeriogs
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With