var page = UrlFetchApp.fetch(contestURL);
var doc = XmlService.parse(page);
The above code gives a parse error when used, however if I replace the XmlService class with the deprecated Xml class, with the lenient flag set, it parses the html properly.
var page = UrlFetchApp.fetch(contestURL);
var doc = Xml.parse(page, true);
The problem is mostly caused because of no CDATA in the javascript part of the html and the parser complains with the following error.
The entity name must immediately follow the '&' in the entity reference.
Even if I remove all the <script>(.*?)</script>
using regex, it still complains because the <br>
tags aren't closed.
Is there a clean way of parsing html into a DOM tree.
I ran into this exact same problem. I was able to circumvent it by first using the deprecated Xml.parse
, since it still works, then selecting the body XmlElement, then passing in its Xml String into the new XmlService.parse
method:
var page = UrlFetchApp.fetch(contestURL);
var doc = Xml.parse(page, true);
var bodyHtml = doc.html.body.toXmlString();
doc = XmlService.parse(bodyHtml);
var root = doc.getRootElement();
Note: This solution may not work if the old Xml.parse
is completely removed from Google Scripts.
In 2021, the best way to parse HTML on the .gs
side that I know of is...
const contentText = UrlFetchApp.fetch('https://www.somesite.com/').getContentText();
const $ = Cheerio.load(contentText);
$('.some-class').first().text();
That's it -- this is probably the closest we'll get to doing jQuery-like DOM selection in GAS. The .first()
is important or else you may extract more content than you expected (think of it as using querySelector()
instead of querySelectorAll()
).
Credit where credit is due: https://github.com/tani/cheeriogs
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With