Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read a very large text file into a list in clojure

Tags:

file

text

clojure

What is the best way to read a very large file (like a text file having 100 000 names one on each line) into a list (lazily - loading it as needed) in clojure?

Basically I need to do all sorts of string searches on these items (I do it with grep and reg ex in shell scripts now).

I tried adding '( at the beginning and ) at the end but apparently this method (loading a static?/constant list, has a size limitation for some reason.

like image 910
Ali Avatar asked Nov 07 '10 14:11

Ali


3 Answers

There are various ways of doing this, depending on exactly what you want.

If you have a function that you want to apply to each line in a file, you can use code similar to Abhinav's answer:

(with-open [rdr ...]
  (doall (map function (line-seq rdr))))

This has the advantage that the file is opened, processed, and closed as quickly as possible, but forces the entire file to be consumed at once.

If you want to delay processing of the file you might be tempted to return the lines, but this won't work:

(map function ; broken!!!
    (with-open [rdr ...]
        (line-seq rdr)))

because the file is closed when with-open returns, which is before you lazily process the file.

One way around this is to pull the entire file into memory with slurp:

(map function (slurp filename))

That has an obvious disadvantage - memory use - but guarantees that you don't leave the file open.

An alternative is to leave the file open until you get to the end of the read, while generating a lazy sequence:

(ns ...
  (:use clojure.test))

(defn stream-consumer [stream]
  (println "read" (count stream) "lines"))

(defn broken-open [file]
  (with-open [rdr (clojure.java.io/reader file)]
    (line-seq rdr)))

(defn lazy-open [file]
  (defn helper [rdr]
    (lazy-seq
      (if-let [line (.readLine rdr)]
        (cons line (helper rdr))
        (do (.close rdr) (println "closed") nil))))
  (lazy-seq
    (do (println "opening")
      (helper (clojure.java.io/reader file)))))

(deftest test-open
  (try
    (stream-consumer (broken-open "/etc/passwd"))
    (catch RuntimeException e
      (println "caught " e)))
  (let [stream (lazy-open "/etc/passwd")]
    (println "have stream")
    (stream-consumer stream)))

(run-tests)

Which prints:

caught  #<RuntimeException java.lang.RuntimeException: java.io.IOException: Stream closed>
have stream
opening
closed
read 29 lines

Showing that the file wasn't even opened until it was needed.

This last approach has the advantage that you can process the stream of data "elsewhere" without keeping everything in memory, but it also has an important disadvantage - the file is not closed until the end of the stream is read. If you are not careful you may open many files in parallel, or even forget to close them (by not reading the stream completely).

The best choice depends on the circumstances - it's a trade-off between lazy evaluation and limited system resources.

PS: Is lazy-open defined somewhere in the libraries? I arrived at this question trying to find such a function and ended up writing my own, as above.

like image 110
andrew cooke Avatar answered Nov 13 '22 16:11

andrew cooke


Andrew's solution worked well for me, but nested defns are not so idiomatic, and you don't need to do lazy-seq twice: here is an updated version without the extra prints and using letfn:

(defn lazy-file-lines [file]
  (letfn [(helper [rdr]
                  (lazy-seq
                    (if-let [line (.readLine rdr)]
                      (cons line (helper rdr))
                      (do (.close rdr) nil))))]
         (helper (clojure.java.io/reader file))))

(count (lazy-file-lines "/tmp/massive-file.txt"))
;=> <a large integer>
like image 24
JohnJ Avatar answered Nov 13 '22 14:11

JohnJ


You need to use line-seq. An example from clojuredocs:

;; Count lines of a file (loses head):
user=> (with-open [rdr (clojure.java.io/reader "/etc/passwd")]
         (count (line-seq rdr)))

But with a lazy list of strings, you cannot do those operations efficiently which require the whole list to be present, like sorting. If you can implement your operations as filter or map then you can consume the list lazily. Otherwise it'll be better to use an embedded database.

Also note that you should not hold on to the head of the list, otherwise the whole list will be loaded in memory.

Furthermore, if you need to do more than one operation, you'll need to read the file again and again. Be warned, laziness can make things difficult sometimes.

like image 21
Abhinav Sarkar Avatar answered Nov 13 '22 14:11

Abhinav Sarkar