Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

clojure: Efficiently determining if a string begins with any prefix in a collection

Tags:

clojure

I'm have a collection of prefix/value pairs, and wish to find any value in this connection associated with a prefix that my current target string begins with. (It is not important that behavior be defined in the case where more than one prefix matches, as the nature of my use case is such that this should never occur).

A naive (working) implementation follows:

(defn prefix-match [target-str pairs]
  (some
    (fn [[k v]]
        (if (.startsWith target-str k)
            v
            false))
    pairs))

Such that:

user=> (prefix-match "foobar" {"meh" :qux, "foo" :baz})
:baz

This works as intended, but is O(n) with the length of the pairs sequence. (Fast insertion into pairs is also desirable, but not as important as fast lookup).

The first thing that comes to mind is bisecting a sorted collection with efficient random access, but I'm not sure which data structures in Clojure are most appropriate to the task. Suggestions?

like image 424
Charles Duffy Avatar asked Mar 21 '12 15:03

Charles Duffy


4 Answers

An efficient, terse approach is to take advantage of rsubseq, which works on any type implementing clojure.lang.Sorted -- which includes sorted-map.

(defn prefix-match [sorted-map target]
  (let [[closest-match value] (first (rsubseq sorted-map <= target))]
    (if closest-match
      (if (.startsWith target closest-match)
        value
        nil)
      nil)))

This passes the relevant tests in my suite:

(deftest prefix-match-success
  (testing "prefix-match returns a successful match"
    (is (prefix-match (sorted-map "foo" :one "bar" :two) "foobar") :one)
    (is (prefix-match (sorted-map "foo" :one "bar" :two) "foo") :one)))

(deftest prefix-match-fail
  (testing "prefix-match returns nil on no match"
    (is (= nil (prefix-match (sorted-map "foo" :one, "bar" :two) "bazqux")))
    (is (= nil (prefix-match (sorted-map "foo" :one, "bar" :two) "zzz")))
    (is (= nil (prefix-match (sorted-map "foo" :one, "bar" :two) "aaa")))))
like image 113
Charles Duffy Avatar answered Nov 17 '22 16:11

Charles Duffy


How about a trie?

(defn build-trie [seed & kvs]
  (reduce
   (fn [trie [k v]]
     (assoc-in trie (concat k [:val]) v))
   seed
   (partition 2 kvs)))

(defn prefix-match [target trie]
  (when (seq target)
    (when-let [node (trie (first target))]
      (or (:val node)
          (recur (rest target) node)))))

Usage:

user> (def trie (build-trie {} "foo" :baz "meh" :qux))
#'user/trie
user> trie
{\m {\e {\h {:val :qux}}}, \f {\o {\o {:val :baz}}}}
user> (prefix-match "foobar" trie)
:baz
user> (prefix-match "foo" trie)
:baz
user> (prefix-match "f" trie)
nil
user> (prefix-match "abcd" trie)
nil
like image 45
Justin Kramer Avatar answered Nov 17 '22 17:11

Justin Kramer


It seems simplest to just turn the list of prefixes into a regular expression, and feed those into a regex matcher, which is optimized for exactly this sort of task. Something like

(java.util.regex.Pattern/compile (str "^"
                                      "(?:"
                                      (clojure.string/join "|"
                                                           (map #(java.util.regex.Pattern/quote %)
                                                                prefixes))
                                      ")"))

Should get you a regex suitable for testing against a string (but I haven't tested it at all, so maybe I got some method names wrong or something).

like image 2
amalloy Avatar answered Nov 17 '22 16:11

amalloy


The following solution finds the longest matching prefix and works surprisingly well when the map is huge and strings are relatively short. It tries to match e.g. "foobar", "fooba", "foob", "foo", "fo", "f" in order and returns the first match.

(defn prefix-match
  [s m]
  (->> (for [end (range (count s) 0 -1)] (.subSequence s 0 end)) ; "foo", "fo", "f"
       (map m)           ; match "foo", match "fo", ...
       (remove nil?)     ; ignore unmatched
       (first)))         ; Take first and longest match
like image 2
jkj Avatar answered Nov 17 '22 17:11

jkj