Using JSoup to parse a html string with Clojure, the source as the following
Dependencies
:dependencies [[org.clojure/clojure "1.10.1"]
[org.jsoup/jsoup "1.13.1"]]
Source code
(require '[clojure.string :as str])
(def HTML (str "<html><head><title>Website title</title></head>
<body><p>Sample paragraph number 1 </p>
<p>Sample paragraph number 2</p>
</body></html>"))
(defn fetch_html [html]
(let [soup (Jsoup/parse html)
titles (.title soup)
paragraphs (.getElementsByTag soup "p")]
{:title titles :paragraph paragraphs}))
(fetch_html HTML)
Expected result
{:title "Website title",
:paragraph ["Sample paragraph number 1"
"Sample paragraph number 2"]}
Unfortunately, the result is not as expected
user ==> (fetch_html HTML)
{:title "Website title", :paragraph []}
(.getElementsByTag ...) returns a sequence of Element's, you need to call .text() method on each element to get the text value. I'm using Jsoup ver 1.13.1.
(ns core
(:import (org.jsoup Jsoup))
(:require [clojure.string :as str]))
(def HTML (str "<html><head><title>Website title</title></head>
<body><p>Sample paragraph number 1 </p>
<p>Sample paragraph number 2</p>
</body></html>"))
(defn fetch_html [html]
(let [soup (Jsoup/parse html)
titles (.title soup)
paragraphs (.getElementsByTag soup "p")]
{:title titles :paragraph (mapv #(.text %) paragraphs)}))
(fetch_html HTML)
Also consider using Reaver, which is a Clojure library that wraps JSoup, or any other wrappers like others have suggested.
I have a Clojure wrapper for TagSoup that might be useful. Try running it in this template project. To use in your project, add the line:
[tupelo "21.01.05"]
to your :dependencies in project.clj.
The code example:
(ns tst.demo.core
(:use demo.core tupelo.core tupelo.test)
(:require
[tupelo.parse.tagsoup :as tagsoup]
))
(dotest
(let [html "<html>
<head><title>Website title</title></head>
<body><p>Sample paragraph number 1 </p>
<p>Sample paragraph number 2</p>
</body></html>"]
(is= (tagsoup/parse html)
{:tag :html,
:attrs {},
:content [{:tag :head,
:attrs {},
:content [{:tag :title, :attrs {}, :content ["Website title"]}]}
{:tag :body,
:attrs {},
:content [{:tag :p, :attrs {}, :content ["Sample paragraph number 1 "]}
{:tag :p, :attrs {}, :content ["Sample paragraph number 2"]}]}]})))
Details
If you look at the source code, you can readily see why you want to use a wrapper function!
(ns tupelo.parse.tagsoup
(:use tupelo.core)
(:require
[schema.core :as s]
[tupelo.parse.xml :as xml]
[tupelo.string :as ts]
[tupelo.schema :as tsk]))
(s/defn ^:private tagsoup-parse-fn
[input-source :- org.xml.sax.InputSource
content-handler]
(doto (org.ccil.cowan.tagsoup.Parser.)
(.setFeature "http://www.ccil.org/~cowan/tagsoup/features/default-attributes" false)
(.setFeature "http://www.ccil.org/~cowan/tagsoup/features/cdata-elements" true)
(.setFeature "http://www.ccil.org/~cowan/tagsoup/features/ignorable-whitespace" true)
(.setContentHandler content-handler)
(.setProperty "http://www.ccil.org/~cowan/tagsoup/properties/auto-detector"
(proxy [org.ccil.cowan.tagsoup.AutoDetector] []
(autoDetectingReader [^java.io.InputStream is]
(java.io.InputStreamReader. is "UTF-8"))))
(.setProperty "http://xml.org/sax/properties/lexical-handler" content-handler)
(.parse input-source)))
; #todo make use string input: (ts/string->stream html-str)
(s/defn parse-raw :- tsk/KeyMap
"Loads and parse an HTML resource and closes the input-stream."
[html-str :- s/Str]
(xml/parse-raw-streaming
(org.xml.sax.InputSource.
(ts/string->stream html-str))
tagsoup-parse-fn))
; #todo make use string input: (ts/string->stream html-str)
(s/defn parse :- tsk/KeyMap
"Loads and parse an HTML resource and closes the input-stream."
[html-str :- s/Str]
(xml/enlive-remove-whitespace
(xml/enlive-normalize
(parse-raw
html-str))))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With