Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Lazily extract lines from large file

I'm trying to grab 5 lines by their line numbers from a large (> 1GB) file with Clojure. I'm almost there but am seeing some strange things, and I want to understand what's going on.

So far I've got:

(defn multi-nth [values indices]
  (map (partial nth values) indices))

(defn read-lines [file indices]
  (with-open [rdr (clojure.java.io/reader file)]
    (let [lines (line-seq rdr)]
      (multi-nth lines indices))))

Now, (read-lines "my-file" [0]) works without a problem. However, passing in [0 1] gives me the following stacktrace:

java.lang.RuntimeException: java.io.IOException: Stream closed
        Util.java:165 clojure.lang.Util.runtimeException
      LazySeq.java:51 clojure.lang.LazySeq.sval
      LazySeq.java:60 clojure.lang.LazySeq.seq
         Cons.java:39 clojure.lang.Cons.next
          RT.java:769 clojure.lang.RT.nthFrom
          RT.java:742 clojure.lang.RT.nth
         core.clj:832 clojure.core/nth
         AFn.java:163 clojure.lang.AFn.applyToHelper
         AFn.java:151 clojure.lang.AFn.applyTo
         core.clj:602 clojure.core/apply
        core.clj:2341 clojure.core/partial[fn]
      RestFn.java:408 clojure.lang.RestFn.invoke
        core.clj:2430 clojure.core/map[fn]

It seems that the stream is being closed before I can read the second line from the file. Interestingly, if I manually pull out a line from the file with something like (nth lines 200), the multi-nth call works for all values <= 200.

Any idea what's going on?

like image 605
David J. Avatar asked Aug 16 '12 21:08

David J.


Video Answer


2 Answers

map (and line-seq) return lazy sequences, so none of the lines are necessarily read by the time your call to with-open returns, which closes the file.

basically, you need to realize the whole return value before with-open returns, for which you can use doall:

(defn multi-nth [values indices]
  (map (partial nth values) indices))

(defn read-lines [file indices]
  (with-open [rdr (clojure.java.io/reader file)]
    (let [lines (line-seq rdr)]
      (doall (multi-nth lines indices)))))

or something like that. keep in mind that your multi-nth holds on to the head of the line seq while searching for the specified lines, which means it'll keep all of the lines up until the last specified one in memory - and using nth like that means you're stepping through the line-seq repeatedly for each index - you'll want to fix that.

update:

Something like this will work. It's a little uglier than I like but it shows the principle, I think: Note that indices here needs to be a set.

(defn multi-nth [values indices]
 (keep 
   (fn [[number line]] 
     (if (contains? indices number) 
       line))
   (map-indexed vector values)))

(multi-nth '(a b c d e) #{2 3})
  => c d
like image 194
Joost Diepenmaat Avatar answered Oct 19 '22 03:10

Joost Diepenmaat


with-file closes the file once its body has been executed. So once the multi-nth has been executed the file is closed, which means that you end up with a lazy sequence pointing to a closed file.

(read-lines "my-file" [0]) works because only the first element of the lazy sequence is realized.

To fix the issue, you need to force the sequence to be realized with doall:

(defn multi-nth [values indices]
  (doall (map (partial nth values) indices)))

For a very detailed explanation see https://stackoverflow.com/a/10462159/151650

like image 28
DanLebrero Avatar answered Oct 19 '22 02:10

DanLebrero