I'm looking for a library to parse HTML files in OCaml. Basically the equivalent of Jsoup/Beautiful Soup. The main requirement is being able to query the DOM with CSS selectors. Something in the form of
page.fetch("http://www.url.com")
page.find("#tag")
I had a need for something like this recently, so after seeing this question and reading the recommendations in the comments, I wrote a library "Lambda Soup" over the weekend for fun.
You will want to use a library like ocurl or Cohttp to retrieve the actual HTML. After you have it, you can do
html |> parse $ "#tag"
to do what is asked in the question. For other possibilities and the full signature, see the documentation. You may want to look at the documentation postprocessor or tests for a fairly thorough demonstration of usage and capabilities, including CSS support and extensions.
Per comments, Lambda Soup uses Ocamlnet's HTML parser. Lambda Soup uses Markup.ml. Otherwise, it has no dependencies, except OUnit if you wish to run the tests. I'm happy for any feedback, including about modifying the interface (it is at an early stage) or discussions of adding an HTTP downloader to the library (which seems iffy because it greatly alters the scope of the library as it now is, but I am happy to hear arguments).
The license is BSD.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With