Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

speeding up deleting duplicates when they're adjacent

I'm looking for something like #'delete-duplicates, but I know that all elements of the list are already sorted, or inversely sorted, or at least arranged so that duplicates will already be adjacent to each other. I wish to use that knowledge to ensure that execution speed is not proporational to the square of the number of elements in the list. It's trivial to use #'maplist to grow my own solution, but is there something already in the language? It would be embarrassing to reinvent the wheel.

To be clear, for largish lengths of lists, I would like the running time of the deletion to be proportional to the length of the list, not proportional to the square of that length. This is the behavior I wish to avoid:

 1 (defun one-shot (cardinality)
 2   (labels ((generate-list (the-count)
 3              (let* ((the-list (make-list the-count)))
 4                (do ((iterator 0 (1+ iterator)))
 5                  ((>= iterator the-count))
 6                  (setf (nth iterator the-list) iterator))
 7                the-list)))
 8     (let* ((given-list (generate-list cardinality))
 9            (stripped-list)
10            (start-time)
11            (end-time))
12       (setf start-time (get-universal-time))
13       (setf stripped-list (delete-duplicates given-list :test #'eql))
14       (setf end-time (get-universal-time))
15       (princ "for n = ")
16       (princ cardinality)
17       (princ ", #'delete-duplicates took ")
18       (princ (- end-time start-time))
19       (princ " seconds")
20       (terpri))))
21 (one-shot 20000)
22 (one-shot 40000)
23 (one-shot 80000)
for n = 20000, #'delete-duplicates took 6 seconds
for n = 40000, #'delete-duplicates took 24 seconds
for n = 80000, #'delete-duplicates took 95 seconds
like image 994
Bill Evans at Mariposa Avatar asked Dec 08 '22 11:12

Bill Evans at Mariposa


2 Answers

There's nothing like this in the language, but something like this makes just one pass through the list:

(defun delete-adjacent-duplicates (list &key key (test 'eql))
  (loop
     for head = list then (cdr head)
     until (endp head)
     finally (return list)
     do (setf (cdr head)
              (member (if (null key) (car head)
                          (funcall key (car head)))
                      (cdr head)
                      :key key :test-not test))))

As, @wvxvw pointed out, it might be possible to simplify this iteration using (loop for head on list finally (return list) do ...). However, 3.6 Traversal Rules and Side Effects says that modifying the cdr chain of a list during an object-traversal leads to undefined behavior. However, it's not clear whether loop for head on list is technically an object-traversal operation or not. The documentation about loop says in 6.1.2.1.3 The for-as-on-list subclause that

In the for-as-on-list subclause, the for or as construct iterates over a list. … The variable var is bound to the successive tails of the list in form1. At the end of each iteration, the function step-fun is applied to the list; the default value for step-fun is cdr. … The for or as construct causes termination when the end of the list is reached.

This says that the step function is always applied at the end of the iteration, so it sounds like loop for head on list should be OK. At any rate, any possible issues could be avoided by using do loop instead:

(defun delete-adjacent-duplicates (list &key key (test 'eql))
  (do ((head list (cdr head)))
      ((endp head) list)
    (setf (cdr head)
          (member (if (null key) (car head)
                      (funcall key (car head)))
                  (cdr head)
                  :key key :test-not test))))

The idea is to start with head being the list, then setting its cdr to the first tail that starts with a different element, then advancing the head, and continuing until there's nothing left. This should be linear in the length of the list, assuming that member is implemented in a sensible way. The use of member means that you don't have to do any extra work to get :key and :test working in the appropriate way. (Do note that :test for del-dups is going to be the :test-not of member.) Note: there's actually a slight issue with this, in that the key function will called twice for each element in the final list: once when it's the first element of a tail, and once when it's the car of head.

CL-USER> (delete-adjacent-duplicates (list 1 1 1 1 2 2 3 3 3))
(1 2 3)
CL-USER> (delete-adjacent-duplicates (list 1 2 2))
(1 2)
CL-USER> (delete-adjacent-duplicates (list 1 3 5 6 4 2 3 5) :key 'evenp)
(1 6 3)

I expect that any linear time solution is going to take a similar approach; hold a reference to the current head, find the next tail that begins with a different element, and then make that tail the cdr of the head.

like image 175
Joshua Taylor Avatar answered Jan 02 '23 05:01

Joshua Taylor


I would expect REMOVE-DUPLICATES to have a linear time implementation. (And indeed it does* on my local SBCL install.)

Note that REMOVE-DUPLICATES and DELETE-DUPLICATES are specified to have the same return value, and that the side effects of DELETE-DUPLICATES are not guaranteed.

* The linear time code path is only taken when the :test is #'eq,#'eql, #'equal, or #'equalp (it relies on a hash table) and there is no :key or :test-not argument supplied.

like image 23
m-n Avatar answered Jan 02 '23 05:01

m-n