Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing HTML with OCaml

Tags:

html

ocaml

I'm looking for a library to parse HTML files in OCaml. Basically the equivalent of Jsoup/Beautiful Soup. The main requirement is being able to query the DOM with CSS selectors. Something in the form of

page.fetch("http://www.url.com")
page.find("#tag")
like image 931
gidim Avatar asked Nov 03 '15 00:11

gidim


1 Answers

I had a need for something like this recently, so after seeing this question and reading the recommendations in the comments, I wrote a library "Lambda Soup" over the weekend for fun.

You will want to use a library like ocurl or Cohttp to retrieve the actual HTML. After you have it, you can do

html |> parse $ "#tag"

to do what is asked in the question. For other possibilities and the full signature, see the documentation. You may want to look at the documentation postprocessor or tests for a fairly thorough demonstration of usage and capabilities, including CSS support and extensions.

Per comments, Lambda Soup uses Ocamlnet's HTML parser. Lambda Soup uses Markup.ml. Otherwise, it has no dependencies, except OUnit if you wish to run the tests. I'm happy for any feedback, including about modifying the interface (it is at an early stage) or discussions of adding an HTTP downloader to the library (which seems iffy because it greatly alters the scope of the library as it now is, but I am happy to hear arguments).

The license is BSD.

like image 139
antron Avatar answered Nov 14 '22 21:11

antron