Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clojure - process huge files with low memory

I am processing text files 60GB or larger. The files are seperated into a header section of variable length and a data section. I have three functions:

  • head? a predicate to distinguish header lines from data lines
  • process-header process one header line string
  • process-data process one data line string
  • The processing functions asynchronously access and modify an in-memory database

I advanced on a file reading method from another SO thread, which should build a lazy sequence of lines. The idea was to process some lines with one function, then switch the function once and keep processing with the next function.

(defn lazy-file
  [file-name]
  (letfn [(helper [rdr]
            (lazy-seq
             (if-let [line (.readLine rdr)]
               (cons line (helper rdr))
               (do (.close rdr) nil))))]
    (try
      (helper (clojure.java.io/reader file-name))
      (catch Exception e
        (println "Exception while trying to open file" file-name)))))

I use it with something like

(let [lfile (lazy-file "my-file.txt")]
  (doseq [line lfile :while head?]
    (process-header line))
  (doseq [line (drop-while head? lfile)]
    (process-data line)))

Although that works, it's rather inefficient for a couple of reasons:

  • Instead of simply calling process-head until I reach the data and then continuing with process-data, I have to filter header lines and process them, then restart parsing the whole file and drop all header lines to process data. This is the exact opposite of what lazy-file intended to do.
  • Watching memory consumption shows me, that the program, though seemingly lazy, builds up to use as much RAM as would be required to keep the file in memory.

So what is a more efficient, idiomatic way to work with my database?

One idea might be using a multimethod to process header and data dependant on the value of the head? predicate, but I suppose this would have some serious speed impact, especially as there is only one occurence where the predicate outcome changes from alway true to always false. I didn't benchmark that yet.

Would it be better to use another way to build the line-seq and parse it with iterate? This would still leave me needing to use :while and :drop-while, I guess.

In my research, using NIO file access was mentioned a couple of times, which should improve memory usage. I could not find out yet how to use that in an idiomatic way in clojure.

Maybe I still have a bad grasp of the general idea, how the file should be treated?

As always, any help, ideas or pointers to tuts are greatly appreciated.

like image 609
waechtertroll Avatar asked Dec 17 '15 08:12

waechtertroll


1 Answers

You should use standard library functions.

line-seq, with-open and doseq will easily do the job.

Something in the line of:

(with-open [rdr (clojure.java.io/reader file-path)]
  (doseq [line (line-seq rdr)]
    (if (head? line)
      (process-header line)
      (process-data line))))
like image 51
kawas44 Avatar answered Oct 04 '22 06:10

kawas44