Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to handle with-open properly without stream getting closed before being consumed?

Tags:

clojure

I am writing my first Clojure program.

I am using clojure.data.csv to process a csv file. My file is potentially large and so I do want to exploit laziness. My MWE code to demonstrate my issue is shown below.

When I execute the load-data function, I get "IOException Stream closed" and so it is clear to me that the lazy stream is being closed before the point of consumption.

I have looked over the documentation for data.csv (https://github.com/clojure/data.csv) and can see that one way to prevent the stream from being closed before consumption is to move the stream opening to the callstack where the stream is consumed. As far as I understand it, this is what I have done below since (take 5) is within the confines of with-open. Clearly, I have a conceptual gap. Deeply appreciate any help!

(ns data-load.core
  (:gen-class)
  (:require [clojure.data.csv :as csv]
            [clojure.java.io :as io]))

(defn load-data [from to]
   (with-open [reader (io/reader from)
               writer (io/writer to)]
              (->> (csv/read-csv reader)
              (take 5))))
like image 318
Viswa V Avatar asked Feb 10 '18 17:02

Viswa V


2 Answers

As you said, what you're returning from load-data is a lazy sequence, which by the time it's consumed you've already left the scope of with-open. You just need to force the realization of the lazy sequence before returning it.

As far as I understand it, this is what I have done below since (take 5) is within the confines of with-open.

It is within the scope, but take also returns a lazy sequence! It has only wrapped a lazy sequence in another that won't be realized until after with-open scope. From the clojure.data.csv examples:

(defn sum-second-column [filename]
  (with-open [reader (io/reader filename)]
    (->> (read-column reader 1)
         (drop 1)
         (map #(Double/parseDouble %))
         (reduce + 0)))) ;; this is the only non-lazy operation

The important observation here is that the final operation is reduce which is going to consume the lazy sequence. If you took reduce out and tried to consume the produced sequence from outside the function, you'd get the same "stream closed" exception.

One way to do this is to just turn the sequence into a vector with vec, or use doall which will also force it to be realized:

(defn load-data [from]
  (with-open [reader (io/reader from)]
   (->> (csv/read-csv reader)
        (take 5)
        ;; other intermediate steps go here
        (doall))))

My file is potentially large and so I do want to exploit laziness.

You'll need a way to do all your work before the stream is closed, so you could supply a function to your load-data function to perform on each row of the CSV:

(defn load-data [from f]
  (with-open [reader (io/reader from)]
    (doall (map f (csv/read-csv reader)))))

For example, concatenate the row values into strings:

(load-data (io/resource "input.txt")
           (partial apply str))
=> ("abc" "efg")
like image 134
Taylor Wood Avatar answered Nov 13 '22 09:11

Taylor Wood


If you want a lazy solution then check out https://stackoverflow.com/a/13312151/954570 (all the credits go to the original authors https://stackoverflow.com/users/181772/andrew-cooke and https://stackoverflow.com/users/611752/johnj).

The idea is to manage reader open/close manually and keep the reader open until the sequence is exhausted. It comes with its own quirks but worked well for me (I needed to merge/process data from multiple1 large files that won't fit in memory).

like image 45
vitaly Avatar answered Nov 13 '22 09:11

vitaly