I'm looking for the Clojure/Java equivalent to Python's lxml library.
I've used it a ton in the past for parsing all sorts of html (as a replacement for BeautifulSoup) and it's great to be able to use the same elementtree api for xml as well -- really a trusted friend! Can anyone recommend a similar Java/Clojure library?
About lxml
lxml is an xml and html processing library based off of libxml2. It handles broken html pages very well so it is excellent for screen scraping tasks. It also implements the ElementTree api, so the xml/html structure is represented as a tree object with full support for xpath and css selectors among other things.
It also has some really handy utility functions such as the "cleaner" module which will strip out unwanted tags from the "soup" (ie script tags, style tags, etc...).
So it is simple to use, robust, and VERY fast...!
lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers. This is when the lxml library comes to play.
lxml module of Python is an XML toolkit that is basically a Pythonic binding of the following two C libraries: libxlst and libxml2. lxml module is a very unique and special module of Python as it offers a combination of XML features and speed.
lxml has been downloaded from the Python Package Index millions of times and is also available directly in many package distributions, e.g. for Linux or macOS.
lxml aims to provide a Pythonic API by following as much as possible the ElementTree API. We're trying to avoid inventing too many new APIs, or you having to learn new things -- XML is complicated enough.
Enlive: http://github.com/cgrand/enlive
I've used it for screen-scraping and it works quite well for that. It uses a CSS selector like syntax for getting at elements in the document.
For Java (and thus usable from Clojure) is the tagsoup
-library, which, like lxml
, is a tolerant parser for faulty SGML-variants.
Clojure has a bundled namespace clojure.xml
, but this will only work with valid XML.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With