I'm writing a Clojure library to parse Mac OS X's XML-based property list files. The code works fine unless you give it a large input file, at which point you get java.lang.OutOfMemoryError: Java heap space
.
Here's an example input file (small enough to work fine):
<plist version="1.0">
<dict>
<key>Integer example</key>
<integer>5</integer>
<key>Array example</key>
<array>
<integer>2</integer>
<real>3.14159</real>
</array>
<key>Dictionary example</key>
<dict>
<key>Number</key>
<integer>8675309</integer>
</dict>
</dict>
</plist>
clojure.xml/parse
turns this into:
{:tag :plist, :attrs {:version "1.0"}, :content [
{:tag :dict, :attrs nil, :content [
{:tag :key, :attrs nil, :content ["Integer example"]}
{:tag :integer, :attrs nil, :content ["5"]}
{:tag :key, :attrs nil, :content ["Array example"]}
{:tag :array, :attrs nil, :content [
{:tag :integer, :attrs nil, :content ["2"]}
{:tag :real, :attrs nil, :content ["3.14159"]}
]}
{:tag :key, :attrs nil, :content ["Dictionary example"]}
{:tag :dict, :attrs nil, :content [
{:tag :key, :attrs nil, :content ["Number"]}
{:tag :integer, :attrs nil, :content ["8675309"]}
]}
]}
]}
My code turns this into the Clojure data structure
{"Dictionary example" {"Number" 8675309},
"Array example" [2 3.14159],
"Integer example" 5}
The relevant part of my code looks like
; extract the content contained within e.g. <integer>...</integer>
(defn- first-content
[c]
(first (c :content)))
; return a parsed version of the given tag
(defmulti content (fn [c] (c :tag)))
(defmethod content :array
[c]
(apply vector (for [item (c :content)] (content item))))
(defmethod content :dict
[c]
(apply hash-map (for [item (c :content)] (content item))))
(defmethod content :integer
[c]
(Long. (first-content c)))
(defmethod content :key
[c]
(first-content c))
(defmethod content :real
[c]
(Double. (first-content c)))
; take a java.io.File (or similar) and return the parsed version
(defn parse-plist
[source]
(content (first-content (clojure.xml/parse source))))
The meat of the code is the content
function, a multimethod that dispatches on the :tag (the name of the XML tag). I'm wondering whether there is something different I should be doing in order to make this recursion work better. I tried replacing all three calls to content
with trampoline content
, but that didn't work. Is there anything fancy I should do to get this mutual recursion to work more efficiently? Or am I taking a fundamentally wrong approach?
Edit: By the way, this code is available on GitHub, in which form it might be easier to play around with.
You have multiple (one per child) recursive calls from a single method so your code isn't (and can't be without a heavy reorg)tail-recursive. trampoline
is intended for mutual tail-recursive functions.
How deep, how long is your large XML file? I'm asking because you are getting an OoM not a SO.
Anyway, to solve your recursion problem (which is unlikely to be the one causing the exception) you have to walk down your XML datastructure (eg with xml-zip
) while maintaining a stack (vector or list) representing your result tree under construction.
It's ironic that the traversal of the XML datastructure is somewhat equivalent to the sax events which were used to build the structure.
Heavy recursion will cause a StackOverflowException
, not an OutOfMemoryError
. Also the recursion does not seem to be very deep here (only 3 levels as per the XML file in your example).
My guess is, the OutOfMemoryError
is being thrown because the data structure your large XML files are being parsed into are too large to fit in the JVM heap. You can try increasing the heap size using -Xms
and -Xmx
options. However, the correct way to parse huge XML files is to use SAX events rather than building a tree (DOM or Clojure data structure).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With