I'm doing what I thought was a fairly straightforward task: run a sql query (over about 65K rows of data) using sqlkorma library (http://sqlkorma.com), and for each row transforming it in some way, and then writing to CSV file. I don't really think that 65K rows is all that large given that I have a 8GB laptop, but I also assumed that a sql result set would be lazily fetched and so the whole thing would never get held in memory at the same time. So I was really really surprised when I ended up with this stack trace:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at clojure.lang.PersistentHashMap$BitmapIndexedNode.assoc(PersistentHashMap.java:777)
at clojure.lang.PersistentHashMap.createNode(PersistentHashMap.java:1101)
at clojure.lang.PersistentHashMap.access$600(PersistentHashMap.java:28)
at clojure.lang.PersistentHashMap$BitmapIndexedNode.assoc(PersistentHashMap.java:749)
at clojure.lang.PersistentHashMap$TransientHashMap.doAssoc(PersistentHashMap.java:269)
at clojure.lang.ATransientMap.assoc(ATransientMap.java:64)
at clojure.lang.PersistentHashMap.create(PersistentHashMap.java:56)
at clojure.lang.PersistentHashMap.create(PersistentHashMap.java:100)
at clojure.lang.PersistentArrayMap.createHT(PersistentArrayMap.java:61)
at clojure.lang.PersistentArrayMap.assoc(PersistentArrayMap.java:201)
at clojure.lang.PersistentArrayMap.assoc(PersistentArrayMap.java:29)
at clojure.lang.RT.assoc(RT.java:702)
at clojure.core$assoc.invoke(core.clj:187)
at clojure.core$zipmap.invoke(core.clj:2715)
at clojure.java.jdbc$resultset_seq$thisfn__204.invoke(jdbc.clj:243)
at clojure.java.jdbc$resultset_seq$thisfn__204$fn__205.invoke(jdbc.clj:243)
at clojure.lang.LazySeq.sval(LazySeq.java:42)
at clojure.lang.LazySeq.seq(LazySeq.java:60)
at clojure.lang.Cons.next(Cons.java:39)
at clojure.lang.PersistentVector.create(PersistentVector.java:51)
at clojure.lang.LazilyPersistentVector.create(LazilyPersistentVector.java:31)
at clojure.core$vec.invoke(core.clj:354)
at korma.db$exec_sql$fn__343.invoke(db.clj:203)
at clojure.java.jdbc$with_query_results_STAR_.invoke(jdbc.clj:669)
at korma.db$exec_sql.invoke(db.clj:202)
at korma.db$do_query$fn__351.invoke(db.clj:225)
at clojure.java.jdbc$with_connection_STAR_.invoke(jdbc.clj:309)
at korma.db$do_query.invoke(db.clj:224)
at korma.core$exec.invoke(core.clj:474)
at db$query_db.invoke(db.clj:23)
at main$_main.doInvoke(main.clj:32)
at clojure.lang.RestFn.applyTo(RestFn.java:137)
As far as I can tell from the stack, it has not made it outside the query code (meaning it hasn't reached my transformation/write to CSV code at all). If it matters, my sql is fairly straightforward, basically SELECT * FROM my_table WHERE SOME_ID IS NOT NULL AND ROWNUM < 65000 ORDER BY some_id ASC. This is oracle (to explain rownum above), but I don' think that matters.
EDIT:
Code sample:
(defmacro query-and-print [q] `(do (dry-run ~q) ~q))
(defn query-db []
(query-and-print
(select my_table
(where (and (not= :MY_ID "BAD DATA")
(not= :MY_ID nil)
(raw (str "rownum < " rows))))
(order :MY_ID :asc))))
; args contains rows 65000, and configure-app sets up the jdbc
; connection string, and sets a var with rows value
(defn -main [& args]
(when (configure-app args)
(let [results (query-db)
dedup (dedup-with-merge results)]
(println "Result size: " (count results))
(println "Dedup size: " (count dedup))
(to-csv "target/out.csv" (transform-data dedup)))))
clojure.java.sql creates lazy sequences:
(defn resultset-seq
"Creates and returns a lazy sequence of maps corresponding to
the rows in the java.sql.ResultSet rs. Based on clojure.core/resultset-seq
but it respects the current naming strategy. Duplicate column names are
made unique by appending _N before applying the naming strategy (where
N is a unique integer)."
[^ResultSet rs]
(let [rsmeta (.getMetaData rs)
idxs (range 1 (inc (.getColumnCount rsmeta)))
keys (->> idxs
(map (fn [^Integer i] (.getColumnLabel rsmeta i)))
make-cols-unique
(map (comp keyword *as-key*)))
row-values (fn [] (map (fn [^Integer i] (.getObject rs i)) idxs))
rows (fn thisfn []
(when (.next rs)
(cons (zipmap keys (row-values)) (lazy-seq (thisfn)))))]
(rows)))
Korma fully realizes the sequence by dropping each row to a vector:
(defn- exec-sql [{:keys [results sql-str params]}]
(try
(case results
:results (jdbc/with-query-results rs (apply vector sql-str params)
(vec rs))
:keys (jdbc/do-prepared-return-keys sql-str params)
(jdbc/do-prepared sql-str params))
(catch Exception e
(handle-exception e sql-str params))))
Besides the with-lazy-results route in https://github.com/korma/Korma/pull/66, as a completely different way to resovle the problem, you can simply increase the heap size available to your JVM by setting the appropriate flag. JVMs are not allowed to use all the free memory on your machine; they are strictly limited to the amount you tell them they're allowed to use.0
One way to do this is to set :jvm-opts ["-Xmx4g"] in your project.clj file. (Adjust the exact heap size as necessary.) Another way is to do something like:
export JAVA_OPTS=-Xmx:4g
lein repl # or whatever lanuches your Clojure process
The with-lazy-results route is better in the sense that you can operate on any sized result set, but it's not merged into mainline Korma and requires some updating to work with recent versions. It's good to know how to adjust the JVM's allowed heap size anyway.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With