Just wondering if anyone knows of a web-scraping library that takes advantage of Scala's succinct syntax. So far, I've found Chafe, but this seems poorly-documented and maintained. I'm wondering if anyone out there has done scraping with Scala and has advice. (I'm trying to integrate into an existing Scala framework rather than use a scraper written in, say, Python.)
First there is a plethora of HTML scraping libs in JVM all you need to do is pimp one of them (pimp my library pattern).
The four I have used are:
I have used Selenium but never for scraping. Scala has a wrapper around selenium.
I would recommend pimping an existing Java library over some half baked Scala lib.
I don't have a Scala-specific recommendation, but for the JVM in general I've had good success with:
The Tagsoup route actually works quite well with Scala since Scala's built-in XML "dsl" is pretty concise (if you can forgive its perf issues and occasional API weirdness). Also, Tagsoup will handle nearly any garbage document you give it. It also has niceties like built-in understanding of many HTML entities that other SAXParsers will choke on as being undeclared.
tl;dr - JSoup + CSS selectors if possible, otherwise Tagsoup + scala XML. If slow is ok, tagsoup first, then jsoup the result.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With