Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Newbie transforming CSV files in Clojure

Tags:

perl

clojure

I'm both new and old to programming -- mostly I just write a lot of small Perl scripts at work. Clojure came out just when I wanted to learn Lisp, so I'm trying to learn Clojure without knowing Java either. It's tough, but it's been fun so far.

I've seen several examples of similar problems to mine, but nothing that quite maps to my problem space. Is there a canonical way to extract lists of values for each line of a CSV file in Clojure?

Here's some actual working Perl code; comments included for non-Perlers:

# convert_survey_to_cartography.pl
open INFILE, "< coords.csv";       # Input format "Northing,Easting,Elevation,PointID"
open OUTFILE, "> coords.txt";      # Output format "PointID X Y Z".
while (<INFILE>) {                 # Read line by line; line bound to $_ as a string.
    chomp $_;                      # Strips out each line's <CR><LF> chars.
    @fields = split /,/, $_;       # Extract the line's field values into a list.
    $y = $fields[0];               # y = Northing
    $x = $fields[1];               # x = Easting
    $z = $fields[2];               # z = Elevation
    $p = $fields[3];               # p = PointID
    print OUTFILE "$p $x $y $z\n"  # New file, changed field order, different delimiter.
}

I've puzzled out a little bit in Clojure and tried to cobble it together in an imperative style:

; convert-survey-to-cartography.clj
(use 'clojure.contrib.duck-streams)
(let
   [infile "coords.csv" outfile "coords.txt"]
   (with-open [rdr (reader infile)]
     (def coord (line-seq rdr))
     ( ...then a miracle occurs... )
     (write-lines outfile ":x :y :z :p")))

I don't expect the last line to actually work, but it gets the point across. I'm looking for something along the lines of:

(def values (interleave (:p :y :x :z) (re-split #"," coord)))

Thanks, Bill

like image 662
Bill_B Avatar asked Nov 17 '09 04:11

Bill_B


2 Answers

Please don't use nested def's. It doesn't do, what you think it does. def is always global! For locals use let instead. While the library functions are nice to know, here a version orchestrating some features of functional programming in general and clojure in particular.

(import 'java.io.FileWriter 'java.io.FileReader 'java.io.BufferedReader)

(defn translate-coords

Docstrings can be queried in the REPL via (doc translate-coords). Works eg. for all core functions. So supplying one is a good idea.

  "Reads coordinates from infile, translates them with the given
  translator and writes the result to outfile."

translator is a (maybe anonymous) function which extracts the translation from the surrounding boilerplate. So we can reuse this functions with different transformation rules. The type hints here avoid reflection for the constructors.

  [translator #^String infile #^String outfile]

Open the files. with-open will take care, that the files are closed when its body is left. Be it via normal "drop off the bottom" or be it via a thrown Exception.

  (with-open [in  (BufferedReader. (FileReader. infile))
              out (FileWriter. outfile)]

We bind the *out* stream temporarily to the output file. So any print inside the binding will print to the file.

    (binding [*out* out]

The map means: take the seq and apply the given function to every element and return the seq of the results. The #() is a short-hand notation for an anonymous function. It takes one argument, which is filled in at the %. The doseq is basically a loop over the input. Since we do that for the side effects (namely printing to a file), doseq is the right construct. Rule of thumb: map: lazy => for result, doseq: eager => for side effects.

      (doseq [coords (map #(.split % ",") (line-seq in))]

println takes care for the \n at the end of the line. interpose takes the seq and adds the first argument (in our case " ") between its elements. (apply str [1 2 3]) is equivalent to (str 1 2 3) and is useful to construct function calls dynamically. The ->> is a relatively new macro in clojure, which helps a bit with readability. It means "take the first argument and add it as last item to the function call". The given ->> is equivalent to: (println (apply str (interpose " " (translator coords)))). (Edit: Another note: since the separator is \space, we could here write just as well (apply println (translator coords)), but the interpose version allows to also parametrize the separator as we did with the translator function, while the short version would hardwire \space.)

        (->> (translator coords)
          (interpose " ")
          (apply str)
          println)))))

(defn survey->cartography-format
  "Translate coords in survey format to cartography format."

Here we use destructuring (note the double [[]]). It means the argument to the function is something which can be turned into a seq, eg. a vector or a list. Bind the first element to y, the second to x and so on.

  [[y x z p]]
  [p x y z])

(translate-coords survey->cartography-format "survey_coords.txt" "cartography_coords.txt")

Here again less choppy:

(import 'java.io.FileWriter 'java.io.FileReader 'java.io.BufferedReader)

(defn translate-coords
  "Reads coordinates from infile, translates them with the given
  translator and writes the result to outfile."
  [translator #^String infile #^String outfile]
  (with-open [in  (BufferedReader. (FileReader. infile))
              out (FileWriter. outfile)]
    (binding [*out* out]
      (doseq [coords (map #(.split % ",") (line-seq in))]
        (->> (translator coords)
          (interpose " ")
          (apply str)
          println)))))

(defn survey->cartography-format
  "Translate coords in survey format to cartography format."
  [[y x z p]]
  [p x y z])

(translate-coords survey->cartography-format "survey_coords.txt" "cartography_coords.txt")

Hope this helps.

Edit: For CSV reading you probably want something like OpenCSV.

like image 193
kotarak Avatar answered Oct 10 '22 13:10

kotarak


Here's one way:

(use '(clojure.contrib duck-streams str-utils))                 ;;'
(with-out-writer "coords.txt"
  (doseq [line (read-lines "coords.csv")]
    (let [[x y z p] (re-split #"," line)]
      (println (str-join \space [p x y z])))))

with-out-writer binds *out* such that everything you print will go to the filename or stream you specify, rather than standard-output.

Using def as you're using it isn't idiomatic. A better way is to use let. I'm using destructuring to assign the 4 fields of each line to 4 let-bound names; then you can do what you want with those.

If you're iterating over something for the purpose of side-effects (e.g. I/O) you should usually go for doseq. If you wanted to collect up each line into a hash-map and do something with them later, you could use for:

(with-out-writer "coords.txt"
  (for [line (read-lines "coords.csv")]
    (let [fields (re-split #"," line)]
      (zipmap [:x :y :z :p] fields))))
like image 34
Brian Carper Avatar answered Oct 10 '22 14:10

Brian Carper