Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clojure equivalent to Python's lxml library?

I'm looking for the Clojure/Java equivalent to Python's lxml library.

I've used it a ton in the past for parsing all sorts of html (as a replacement for BeautifulSoup) and it's great to be able to use the same elementtree api for xml as well -- really a trusted friend! Can anyone recommend a similar Java/Clojure library?

About lxml

lxml is an xml and html processing library based off of libxml2. It handles broken html pages very well so it is excellent for screen scraping tasks. It also implements the ElementTree api, so the xml/html structure is represented as a tree object with full support for xpath and css selectors among other things.

It also has some really handy utility functions such as the "cleaner" module which will strip out unwanted tags from the "soup" (ie script tags, style tags, etc...).

So it is simple to use, robust, and VERY fast...!

like image 958
erikcw Avatar asked Oct 14 '09 21:10

erikcw


People also ask

Is XML and lxml are same?

lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers. This is when the lxml library comes to play.

What is lxml library in Python?

lxml module of Python is an XML toolkit that is basically a Pythonic binding of the following two C libraries: libxlst and libxml2. lxml module is a very unique and special module of Python as it offers a combination of XML features and speed.

Is lxml included in Python?

lxml has been downloaded from the Python Package Index millions of times and is also available directly in many package distributions, e.g. for Linux or macOS.

Why is lxml used?

lxml aims to provide a Pythonic API by following as much as possible the ElementTree API. We're trying to avoid inventing too many new APIs, or you having to learn new things -- XML is complicated enough.


2 Answers

Enlive: http://github.com/cgrand/enlive

I've used it for screen-scraping and it works quite well for that. It uses a CSS selector like syntax for getting at elements in the document.

like image 119
dnolen Avatar answered Oct 26 '22 20:10

dnolen


For Java (and thus usable from Clojure) is the tagsoup-library, which, like lxml, is a tolerant parser for faulty SGML-variants.

Clojure has a bundled namespace clojure.xml, but this will only work with valid XML.

like image 43
pmf Avatar answered Oct 26 '22 19:10

pmf