Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Huge file in Clojure and Java heap space error

I posted before on a huge XML file - it's a 287GB XML with Wikipedia dump I want ot put into CSV file (revisions authors and timestamps). I managed to do that till some point. Before I got the StackOverflow Error, but now after solving the first problem I get: java.lang.OutOfMemoryError: Java heap space error.

My code (partly taken from Justin Kramer answer) looks like that:

(defn process-pages
  [page]
  (let [title     (article-title page)
        revisions (filter #(= :revision (:tag %)) (:content page))]
    (for [revision revisions]
      (let [user (revision-user revision)
            time (revision-timestamp revision)]
        (spit "files/data.csv"
              (str "\"" time "\";\"" user "\";\"" title "\"\n" )
              :append true)))))

(defn open-file
[file-name]
(let [rdr (BufferedReader. (FileReader. file-name))]
  (->> (:content (data.xml/parse rdr :coalescing false))
       (filter #(= :page (:tag %)))
       (map process-pages))))

I don't show article-title, revision-user and revision-title functions, because they just simply take data from a specific place in the page or revision hash. Anyone could help me with this - I'm really new in Clojure and don't get the problem.

like image 955
trzewiczek Avatar asked Apr 02 '12 21:04

trzewiczek


2 Answers

Just to be clear, (:content (data.xml/parse rdr :coalescing false)) IS lazy. Check its class or pull the first item (it will return instantly) if you're not convinced.

That said, a couple things to watch out for when processing large sequences: holding onto the head, and unrealized/nested laziness. I think your code suffers from the latter.

Here's what I recommend:

1) Add (dorun) to the end of the ->> chain of calls. This will force the sequence to be fully realized without holding onto the head.

2) Change for in process-page to doseq. You're spitting to a file, which is a side effect, and you don't want to do that lazily here.

As Arthur recommends, you may want to open an output file once and keep writing to it, rather than opening & writing (spit) for every Wikipedia entry.

UPDATE:

Here's a rewrite which attempts to separate concerns more clearly:

(defn filter-tag [tag xml]
  (filter #(= tag (:tag %)) xml))

;; lazy
(defn revision-seq [xml]
  (for [page (filter-tag :page (:content xml))
        :let [title (article-title page)]
        revision (filter-tag :revision (:content page))
        :let [user (revision-user revision)
              time (revision-timestamp revision)]]
    [time user title]))

;; eager
(defn transform [in out]
  (with-open [r (io/input-stream in)
              w (io/writer out)]
    (binding [*out* out]
      (let [xml (data.xml/parse r :coalescing false)]
        (doseq [[time user title] (revision-seq xml)]
          (println (str "\"" time "\";\"" user "\";\"" title "\"\n")))))))

(transform "dump.xml" "data.csv")

I don't see anything here that would cause excessive memory use.

like image 159
Justin Kramer Avatar answered Sep 26 '22 02:09

Justin Kramer


Unfortunately data.xml/parse is not lazy, it attempts to read the whole file into memory and then parse it.

Instead use the this (lazy) xml library which holds only the part it is currently working on in ram. You will then need to re-structure your code to write the output as it reads the input instead of gathering all the xml, then outputting it.

your line

(:content (data.xml/parse rdr :coalescing false)

will load all the xml into memory and then request the content key from it. which will blow the heap.

a rough outline of a lazy answer would look something like this:

(with-open [input (java.io.FileInputStream. "/tmp/foo.xml")
            output (java.io.FileInputStream. "/tmp/foo.csv"]
    (map #(write-to-file output %)
        (filter is-the-tag-i-want? (parse input))))

Have patience, working with (> data ram) always takes time :)

like image 27
Arthur Ulfeldt Avatar answered Sep 26 '22 02:09

Arthur Ulfeldt