I'm new to Clojure and my first project has to deal with huge (250+GB) XML file. I want to put it into PostgreSQL to process it later on, but have no idea how to approach such a big file.
I used the new clojure.data.xml
to process a 31GB Wikipedia dump on a modest laptop. The old lazy-xml
contrib library did not work for me (ran out of memory).
https://github.com/clojure/data.xml
Simplified example code:
(require '[clojure.data.xml :as data.xml]) ;'
(defn process-page [page]
;; ...
)
(defn page-seq [rdr]
(->> (:content (data.xml/parse rdr))
(filter #(= :page (:tag %)))
(map process-page)))
processing huge xml is usually done with SAX, in case of Clojure this is http://richhickey.github.com/clojure-contrib/lazy-xml-api.html
see (parse-seq File/InputStream/URI)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With