I am trying to parse a fairly small (< 100MB) xml file with:
(require '[clojure.data.xml :as xml]
'[clojure.java.io :as io])
(xml/parse (io/reader "data/small-sample.xml"))
and I am getting an error:
OutOfMemoryError Java heap space
clojure.lang.Numbers.byte_array (Numbers.java:1216)
clojure.tools.nrepl.bencode/read-bytes (bencode.clj:101)
clojure.tools.nrepl.bencode/read-netstring* (bencode.clj:153)
clojure.tools.nrepl.bencode/read-token (bencode.clj:244)
clojure.tools.nrepl.bencode/read-bencode (bencode.clj:254)
clojure.tools.nrepl.bencode/token-seq/fn--3178 (bencode.clj:295)
clojure.core/repeatedly/fn--4705 (core.clj:4642)
clojure.lang.LazySeq.sval (LazySeq.java:42)
clojure.lang.LazySeq.seq (LazySeq.java:60)
clojure.lang.RT.seq (RT.java:484)
clojure.core/seq (core.clj:133)
clojure.core/take-while/fn--4236 (core.clj:2564)
Here is my project.clj:
(defproject dats "0.1.0-SNAPSHOT"
...
:dependencies [[org.clojure/clojure "1.5.1"]
[org.clojure/data.xml "0.0.7"]
[criterium "0.4.1"]]
:jvm-opts ["-Xmx1g"])
I tried setting a LEIN_JVM_OPTS and JVM_OPTS in my .bash_profile without success.
When I tried the following project.clj:
(defproject barber "0.1.0-SNAPSHOT"
...
:dependencies [[org.clojure/clojure "1.5.1"]
[org.clojure/data.xml "0.0.7"]
[criterium "0.4.1"]]
:jvm-opts ["-Xms128m"])
I get the following error:
Error occurred during initialization of VM
Incompatible minimum and maximum heap sizes specified
Exception in thread "Thread-5" clojure.lang.ExceptionInfo: Subprocess failed {:exit-code 1}
Any idea how I could increase the heap size for my leiningen repl?
Thanks.
Any form evaluated at the top level of the repl is realized in full, as a result of the print step of the Read-Eval-Print-Loop. It is also stored in the heap, so that you can later access it via *1.
if you store the return value as follows:
(def parsed (xml/parse (io/reader "data/small-sample.xml")))
this returns immediately, even for a file hundreds of megabytes in size (I have verified this locally). You can then iterate across the result, which is realized in full as it is parsed from the input stream, by iterating over the clojure.data.xml.Element tree that is returned.
If you do not hold on to the elements (by binding them so they are still accessible), you can iterate over the entire structure without using more ram than it takes to hold a single node of the xml tree.
user> (time (def n (xml/parse (clojure.java.io/reader "/home/justin/clojure/ok/data.xml"))))
"Elapsed time: 0.739795 msecs"
#'user/n
user> (time (keys n))
"Elapsed time: 0.025683 msecs"
(:tag :attrs :content)
user> (time (-> n :tag))
"Elapsed time: 0.031224 msecs"
:catalog
user> (time (-> n :attrs))
"Elapsed time: 0.136522 msecs"
{}
user> (time (-> n :content first))
"Elapsed time: 0.095145 msecs"
#clojure.data.xml.Element{:tag :book, :attrs {:id "bk101"}, :content (#clojure.data.xml.Element{:tag :author, :attrs {}, :content ("Gambardella, Matthew")} #clojure.data.xml.Element{:tag :title, :attrs {}, :content ("XML Developer's Guide")} #clojure.data.xml.Element{:tag :genre, :attrs {}, :content ("Computer")} #clojure.data.xml.Element{:tag :price, :attrs {}, :content ("44.95")} #clojure.data.xml.Element{:tag :publish_date, :attrs {}, :content ("2000-10-01")} #clojure.data.xml.Element{:tag :description, :attrs {}, :content ("An in-depth look at creating applications \n with XML.")})}
user> (time (-> n :content count))
"Elapsed time: 48178.512106 msecs"
459000
user> (time (-> n :content count))
"Elapsed time: 86.931114 msecs"
459000
;; redefining n so that we can test the performance without the pre-parsing done when we counted
user> (time (def n (xml/parse (clojure.java.io/reader "/home/justin/clojure/ok/data.xml"))))
"Elapsed time: 0.702885 msecs"
#'user/n
user> (time (doseq [el (take 100 (drop 100 (-> n :content)))] (println (:tag el))))
:book
:book
.... ;; output truncated
"Elapsed time: 26.019374 msecs"
nil
user>
Notice that it is only when I first ask for the count of the content of n (thus forcing the whole file to be parsed) that the huge time delay occurs. If I doseq across subsections of the structure, this happens very quickly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With