Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to lazily collapse multiple contiguous items of a sequence into a single item

Tags:

clojure

[Note: Title and text were heavily edited to make it more clear that I'm not particularly after strings, but after general sequences, and lazy processing of same]

Using character sequences / strings as an example, say I want to turn a string like

"\t a\r      s\td \t \r \n          f \r\n"

into

" a s d f "

In more general terms, I want to turn all contiguous whitespace (or any other arbitray set of items) in a sequence into a single item, and that lazily.

I've come up with the following partition-by/mapcat combo, but wonder if there are easier or otherwise better ways (readability, performance, anything) to accomplish the same thing.

(defn is-wsp?
  [c]
  (if (#{\space \tab \newline \return} c) true))

(defn collapse-wsp
  [coll]
  (mapcat
   (fn [[first-elem :as s]]
     (if (is-wsp? first-elem) [\space] s))
   (partition-by is-wsp? coll)))

In action:

=> (apply str (collapse-wsp "\t    a\r          s\td  \t \r \n         f \r\n"))
" a s d f "

Update: I used strings / character sequences / wsp, as an example, but what I actually want is a generic function on sequences of any type that collapses arbitrary numbers of contiguous items, which are part of a predefined set of items, by some single predefined item. I'm particularly interested to know if there are better alternatives to partition-by/mapcat, not so much if this can be optimized for the 'string' special case.

Update 2:

Here's a fully lazy version - the one above isn't fully lazy, I fear, apart from that it is doing redundant is-wsp? checks. I generalized the parameter names etc. so it doesn't just look like something you could easily replace by a String.whatever() call - it's about arbitrary sequences.

(defn lazy-collapse
  ([coll is-collapsable-item? collapsed-item-representation] (lazy-collapse coll is-collapsable-item? collapsed-item-representation false))
  ([coll is-collapsable-item? collapsed-item-representation in-collapsable-segment?]
  (let [step (fn [coll in-collapsable-segment?]
               (when-let [item (first coll)]
                 (if (is-collapsable-item? item)
                   (if in-collapsable-segment?
                     (recur (rest coll) true)
                     (cons collapsed-item-representation (lazy-collapse (rest coll) is-collapsable-item? collapsed-item-representation true)))
                   (cons item (lazy-collapse (rest coll) is-collapsable-item? collapsed-item-representation false)))))]
    (lazy-seq (step coll in-collapsable-segment?)))))

This is fast, fully lazy, but I'd like to be able to express that more concisely, as I am quite lazy myself.

Benchmarks of the lazy collapsers so far: Whether the code is readable or not is easy to judge by looking at the code, but in order to see how they compare in terms of performance, here are my benchmarks. I first check if the function does what it is supposed to do, and then I spit out how long it takes to

  1. create the lazy seq 1M times
  2. create the lazy seq and take the first item 1M times
  3. create the lazy seq and take the second item 1M times
  4. create the lazy seq and take the last item (i.e. fully realize the lazy seq) 1M times

Tests 1 through 3 are meant to gauge the laziness at least a little bit. I ran the test a couple of times, and there was no significant variation in the execution times.

user=> (map
   (fn [collapse]
     (println (class collapse) (str "|" (apply str (collapse test-str is-wsp? \space)) "|"))
     (time (dotimes [_ 1000000] (collapse test-str is-wsp? \space)))
     (time (dotimes [_ 1000000] (first (collapse test-str is-wsp? \space))))
     (time (dotimes [_ 1000000] (second (collapse test-str is-wsp? \space))))
     (time (dotimes [_ 1000000] (last (collapse test-str is-wsp? \space)))))
   [collapse-overthink collapse-smith collapse-normand lazy-collapse])

user$collapse_overthink | a s d f |
"Elapsed time: 153.490591 msecs"
"Elapsed time: 3064.721629 msecs"
"Elapsed time: 4337.932487 msecs"
"Elapsed time: 24797.222682 msecs"

user$collapse_smith | a s d f |
"Elapsed time: 141.474904 msecs"
"Elapsed time: 812.998848 msecs"
"Elapsed time: 2112.331739 msecs"
"Elapsed time: 10750.224816 msecs"

user$collapse_normand | a s d f |
"Elapsed time: 314.978309 msecs"
"Elapsed time: 1423.779761 msecs"
"Elapsed time: 1669.660257 msecs"
"Elapsed time: 8074.759077 msecs"

user$lazy_collapse | a s d f |
"Elapsed time: 169.906088 msecs"
"Elapsed time: 638.030401 msecs"
"Elapsed time: 1195.445016 msecs"
"Elapsed time: 6050.945856 msecs"

Bottom line so far: The nicest code is slowest, the ugliest code is fastest. I'm pretty sure it doesn't have to be this way...

like image 605
SuperHorst Avatar asked Jul 18 '11 04:07

SuperHorst


4 Answers

Here's my fastest solution so far: (basically the same as M Smith's, but no destructuring)

(defn collapse [xs pred rep]
  (when-let [x (first xs)]
    (lazy-seq 
      (if (pred x)
        (cons rep (collapse (drop-while pred (rest xs)) pred rep))
        (cons x (collapse (rest xs) pred rep))))))

Here's a prettier solution, but 3x (!) slower: (effectively the same as SuperHorst's initial version...)

(defn collapse [col pred rep]
  (let [f (fn [[x & more :as xs]] (if (pred x) [rep] xs))]
    (mapcat f (partition-by #(if (pred %) true) col))))

Mini-benchmark (full code) output:

$ clj collapse.clj 
     SuperHorst: "Elapsed time: 58535.737037 msecs"
      Overthink: "Elapsed time: 70154.744605 msecs"
        M Smith: "Elapsed time: 89484.984606 msecs"
   Eric Normand: "Elapsed time: 83121.309838 msecs"

Example:

(def test-str "\t a\r      s\td \t \r \n          f \r\n")
(def is-ws? #{\space \tab \newline \return})

user=> (apply str (collapse test-str is-ws? \space))
" a s d f "

Used on a different type of seq:

user=> (collapse (range 1 110) #(= 2 (count (str %))) \X)
(1 2 3 4 5 6 7 8 9 \X 100 101 102 103 104 105 106 107 108 109)

It's fully lazy:

user=> (type (collapse test-str is-ws? \space))
clojure.lang.LazySeq
user=> (type (collapse (range 1 110) #(= 2 (count (str %))) \X))
clojure.lang.LazySeq

Old buggy version:

(defn collapse-bug [col pred rep]
  (let [f (fn [x] (if (pred x) rep x))]
    (map (comp f first) (partition-by f col))))

The bug is that it eats consecutive items not matching pred.

like image 133
overthink Avatar answered Nov 06 '22 13:11

overthink


Why not simply use Java's String function replaceAll ?

user=> (.replaceAll "\t    a\r          s\td  \t \r \n         f \r\n" "\\s+" " ")
" a s d f "

I think, that it's heavily optimized for such operations...

Update: updated version of answer after clarification:

(defn is-blank? [^Character c] (not (nil? (#{\space \tab \newline \return} c))))

(defn collapse [coll fun replacement]
  (first (reduce (fn [[res replaced?] el]
                   (if (fun el)
                     (if replaced?
                       [res replaced?]
                       [(conj res replacement) true])
                     [(conj res el) false]))
                 [[] false] coll)))

(def aa-str "\t    a\r          s\td  \t \r \n         f \r\n")

user> (apply str (collapse aa-str is-blank? \space))
" a s d f "

It also works with other seqs:

user> (collapse [1 1 2 3 1 1 1 4 5 6 1 1] #(= % 1) 9)
[9 2 3 9 4 5 6 9]

Another performance optimization could be use of transients instead of standard sequences...

like image 26
Alex Ott Avatar answered Nov 06 '22 13:11

Alex Ott


This is a fully-lazy implementation that is a bit cleaner:

<!-- language: lang-clojure -->
(defn collapse [ss pred replacement]  
    (lazy-seq    
          (if-let [s (seq ss)]   
              (let [[f & rr] s]  
               (if (pred f)   
                   (cons replacement (collapse (drop-while pred rr) pred replacement))  
                   (cons f (collapse rr pred replacement)))))))
like image 35
M Smith Avatar answered Nov 06 '22 14:11

M Smith


A fully lazy solution, written in a recursive style:

(defn collapse [l p v]
  (cond
    (nil? (seq l))
    nil
    (p (first l))
    (lazy-seq (cons v (collapse (drop-while p l) p v)))
    :otherwise
    (lazy-seq (cons (first l) (collapse (rest l) p v)))))

l is the list, p is the predicate, and v is the value to replace subsequences that match the predicate with.

If you're after pure speed at the expense of readability, you can do this:

(defn collapse-normand2
  ([l p v]
    (lazy-seq (collapse-normand2 (seq l) p v nil)))
  ([l p v _]
    (when l
      (lazy-seq
        (let [f (first l)
              r (next l)]
          (if (p f)
            (cons v (collapse-normand2 r p v nil nil))
            (cons f (collapse-normand2 r p v nil)))))))
  ([l p v _ _]
     (when l
       (lazy-seq
         (let [f (first l)
               r (next l)]
           (if (p f)
             (collapse-normand2 r p v nil nil)
             (collapse-normand2 r p v nil)))))))

There's probably a way to make this more readable. It does very well on all 4 tests.

like image 33
Eric Normand Avatar answered Nov 06 '22 13:11

Eric Normand