Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Huge XML in Clojure

Tags:

xml

clojure

I'm new to Clojure and my first project has to deal with huge (250+GB) XML file. I want to put it into PostgreSQL to process it later on, but have no idea how to approach such a big file.

like image 319
trzewiczek Avatar asked Mar 30 '12 08:03

trzewiczek


2 Answers

I used the new clojure.data.xml to process a 31GB Wikipedia dump on a modest laptop. The old lazy-xml contrib library did not work for me (ran out of memory).

https://github.com/clojure/data.xml

Simplified example code:

(require '[clojure.data.xml :as data.xml]) ;'

(defn process-page [page]
  ;; ...
  )

(defn page-seq [rdr]
  (->> (:content (data.xml/parse rdr))
       (filter #(= :page (:tag %)))
       (map process-page)))
like image 121
Justin Kramer Avatar answered Sep 20 '22 21:09

Justin Kramer


processing huge xml is usually done with SAX, in case of Clojure this is http://richhickey.github.com/clojure-contrib/lazy-xml-api.html

see (parse-seq File/InputStream/URI)

like image 45
zmila Avatar answered Sep 23 '22 21:09

zmila