Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clojure - Speed up large file processing

I need to read large file (~1GB), process it and save to db. My solution looks like that:

data.txt

format: [id],[title]\n

1,Foo
2,Bar
...

code

(ns test.core
  (:require [clojure.java.io :as io]
            [clojure.string :refer [split]]))

(defn parse-line
  [line]
  (let [values (split line #",")]
    (zipmap [:id :title] values)))

(defn run
  []
  (with-open [reader (io/reader "~/data.txt")]
    (insert-batch (map parse-line (line-seq reader)))))

; insert-batch just save vector of records into database

But this code does not work well, because it first parse all lines and then send them into database.

I think the ideal solution would be read line -> parse line -> collect 1000 parsed lines -> batch insert them into database -> repeat until there is no lines. Unfortunately, I have no idea how to implement this.

like image 231
user1518183 Avatar asked Apr 22 '15 22:04

user1518183


1 Answers

One suggestion:

  • Use line-seq to get a lazy sequence of lines,

  • use map to parse each line,

(so far this matches what you are doing)

  • use partition-all to partition your lazy sequence of parsed lines into batches, and then

  • use insert-batch with doseq to write each batch to the database.

And an example:

(->> (line-seq reader)
     (map parse-line)
     (partition-all 1000)
     (#(doseq [batch %] 
       (insert-batch batch))))
like image 177
bsvingen Avatar answered Sep 28 '22 18:09

bsvingen