I'm learning Clojure and would like some advice on idiomatic usage. As part of a small statistics package, I have a function to calculate the mode of a set of data. (Background: The mode is the most common value in a set of data. There are almost a dozen published algorithms to calculate it. The one used here is from "Fundamentals of Biostatistics" 6th Ed by Bernard Rosner.)
(defn tally-map
" Create a map where the keys are all of the unique elements in the input
sequence and the values represent the number of times those elements
occur. Note that the keys may not be formatted as conventional Clojure
keys, i.e. a colon preceding a symbol."
[aseq]
(apply merge-with + (map (fn [x] {x 1}) aseq)))
(defn mode
" Calculate the mode. Rosner p. 13. The mode is problematic in that it may
not be unique and may not exist at all for a particular group of data.
If there is a single unique mode, it is returned. If there are multiple
modes, they are returned as a list. If there is no mode, that is all
elements are present in equal frequency, nil is returned."
[aseq]
(let [amap (tally-map aseq)
mx (apply max (vals amap))
k (keys amap)
f (fn [x] (not (nil? x)))
modes (filter f (map #(if (= mx (get amap %)) %) k))
]
(cond (= 1 (count modes)) (first modes)
(every? #(= mx %) (vals amap)) nil
:else modes)
)
)
There are a couple of things I have questions about:
Thanks in advance for the help.
In my opinion, mapping some function over a collection and then immediately condensing the list down to one item is a sign to use reduce
.
(defn tally-map [coll]
(reduce (fn [h n]
(assoc h n (inc (h n 0))))
{} coll))
In this case I'd write the mode
fn to take a single collection as an argument, as you did. The only reason I can think of to use multiple arguments for a function like this is if you plan to have to type literal arguments a lot.
So if e.g. this is for an interactive REPL script and you're often going to be typing (mode [1 2 1 2 3])
literally, then you should have the function take multiple arguments, to save you from typing the extra []
in the function call all the time. If you plan to read lots of numbers from a file and then take the mode of those numbers, then have the function take a single argument that is a collection so you can save yourself from using apply
all the time. I'm guessing your most common use case is the latter. I believe apply
also adds overhead that you avoid when you have a function call that takes a collection argument.
I agree with others that you should have mode
return a list of results even if there's only one; it'll make your life easier. Maybe rename it modes
while you're at it.
Here's a nice concise implementation of mode
:
(defn mode [data]
(first (last (sort-by second (frequencies data)))))
This exploits the following facts:
frequencies
function returns a map of values -> frequenciessecond
item in each pair), then the last item in the sequence will represent the modeEDIT
If you want to handle the multiple mode case then you can insert an extra partition-by
to keep all the values with the maximum frequency:
(defn modes [data]
(->> data
frequencies
(sort-by second)
(partition-by second)
last
(map first)))
Here's my take:
There are many core clojure functions that take sequences as arguments while others take multiple arguments, so there is no real idiomatic way in my opinion. If you already have your data in a sequence, I would use a seq as argument, since it will save you a call to apply.
I wouldn't write a function that returns a value in some cases and a list of values in others, because the calling code will always have to check the return value before using it. Instead I would return a single mode as a seq with just one item in it. But you may have your reasons, depending on the code that calls this function.
Apart from that I would rewrite the mode function like this:
(defn mode [aseq]
(let [amap (tally-map aseq)
mx (apply max (vals amap))
modes (map key (filter #(= mx (val %)) amap))
c (count modes)]
(cond
(= c 1) (first modes)
(= c (count amap)) nil
:default modes)))
Instead of defining a function f you could use the identity function (unless your data contains values that are logically false). But you don't even need that. I find the modes in a different way, which is more readable to me: The map amap acts as a sequence of map entries (key-value pairs). First I filter only those entries that have the value mx. Then I map the key function on these, giving me a sequence of keys.
To check whether there are any modes I don't loop over the map again. Instead I just compare the number of modes to the number of map entries. If they are equal, all elements have the same frequency!
Here's the function that always returns a seq:
(defn modes [aseq]
(let [amap (tally-map aseq)
mx (apply max (vals amap))
modes (map key (filter #(= mx (val %)) amap))]
(when (< (count modes) (count amap)) modes)))
Looks fine to me. I'd replace the
f (fn [x] (not (nil? x)))
mode (filter f (map #(if (= mx (get amap %)) %) k))
with
mode (remove nil? (map #(if (= mx (get amap %)) %) k))
(I don't know why something like not-nil?
isn't in clojure.core
; it's something one needs every day.)
If there is a single unique mode, it is returned. If there are multiple modes, they are returned as a list. If there is no mode, that is all elements are present in equal frequency, nil is returned."
You could think about simply returning a seq every time (one element or empty is fine); otherwise, the cases have to be differentiated by the calling code. By always returning a seq, your result will magically work as an argument to other functions that expect a seq.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With