Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

clojure read large text file and count occurrences

Tags:

clojure

I'm trying to read a large text file and count occurrences of specific errors. For example, for the following sample text

something
bla
error123
foo
test
error123
line
junk
error55
more
stuff

I want to end up with (don't really care what data structure although I am thinking a map)

error123 - 2
error55 - 1

Here is what I have tried so far

(require '[clojure.java.io :as io])

(defn find-error [line]
  (if (re-find #"error" line)    
       line))


(defn read-big-file [func, filename]
 (with-open [rdr (io/reader filename)]
   (doall (map func (line-seq rdr)))))  

calling it like this

 (read-big-file find-error "sample.txt")

returns:

(nil nil "error123" nil nil "error123" nil nil "error55" nil nil)

Next I tried to remove the nil values and group like items

(group-by identity (remove #(= nil %) (read-big-file find-error "sample.txt")))

which returns

{"error123" ["error123" "error123"], "error55" ["error55"]}

This is getting close to the desired output, although it may not be efficient. How can I get the counts now? Also,as someone new to clojure and functional programming I would appreciate any suggestions on how I might improve this. thanks!

like image 459
Rob Buhler Avatar asked Mar 23 '23 11:03

Rob Buhler


1 Answers

I think you might be looking for the frequencies function:

user=> (doc frequencies)
-------------------------
clojure.core/frequencies
([coll])
  Returns a map from distinct items in coll to the number of times
  they appear.
nil

So, this should give you what you want:

(frequencies (remove nil? (read-big-file find-error "sample.txt")))
;;=> {"error123" 2, "error55" 1}

If your text file is really large, however, I would recommend doing this on the line-seq inline to ensure you don't run out of memory. This way you can also use a filter rather than map and remove.

(defn count-lines [pred, filename]
  (with-open [rdr (io/reader filename)]
    (frequencies (filter pred (line-seq rdr)))))

(defn is-error-line? [line]
  (re-find #"error" line))

(count-lines is-error-line? "sample.txt")
;; => {"error123" 2, "error55" 1}
like image 95
mange Avatar answered Apr 28 '23 05:04

mange