I'm have a collection of prefix/value pairs, and wish to find any value in this connection associated with a prefix that my current target string begins with. (It is not important that behavior be defined in the case where more than one prefix matches, as the nature of my use case is such that this should never occur).
A naive (working) implementation follows:
(defn prefix-match [target-str pairs]
(some
(fn [[k v]]
(if (.startsWith target-str k)
v
false))
pairs))
Such that:
user=> (prefix-match "foobar" {"meh" :qux, "foo" :baz})
:baz
This works as intended, but is O(n) with the length of the pairs
sequence. (Fast insertion into pairs
is also desirable, but not as important as fast lookup).
The first thing that comes to mind is bisecting a sorted collection with efficient random access, but I'm not sure which data structures in Clojure are most appropriate to the task. Suggestions?
An efficient, terse approach is to take advantage of rsubseq
, which works on any type implementing clojure.lang.Sorted
-- which includes sorted-map
.
(defn prefix-match [sorted-map target]
(let [[closest-match value] (first (rsubseq sorted-map <= target))]
(if closest-match
(if (.startsWith target closest-match)
value
nil)
nil)))
This passes the relevant tests in my suite:
(deftest prefix-match-success
(testing "prefix-match returns a successful match"
(is (prefix-match (sorted-map "foo" :one "bar" :two) "foobar") :one)
(is (prefix-match (sorted-map "foo" :one "bar" :two) "foo") :one)))
(deftest prefix-match-fail
(testing "prefix-match returns nil on no match"
(is (= nil (prefix-match (sorted-map "foo" :one, "bar" :two) "bazqux")))
(is (= nil (prefix-match (sorted-map "foo" :one, "bar" :two) "zzz")))
(is (= nil (prefix-match (sorted-map "foo" :one, "bar" :two) "aaa")))))
How about a trie?
(defn build-trie [seed & kvs]
(reduce
(fn [trie [k v]]
(assoc-in trie (concat k [:val]) v))
seed
(partition 2 kvs)))
(defn prefix-match [target trie]
(when (seq target)
(when-let [node (trie (first target))]
(or (:val node)
(recur (rest target) node)))))
Usage:
user> (def trie (build-trie {} "foo" :baz "meh" :qux))
#'user/trie
user> trie
{\m {\e {\h {:val :qux}}}, \f {\o {\o {:val :baz}}}}
user> (prefix-match "foobar" trie)
:baz
user> (prefix-match "foo" trie)
:baz
user> (prefix-match "f" trie)
nil
user> (prefix-match "abcd" trie)
nil
It seems simplest to just turn the list of prefixes into a regular expression, and feed those into a regex matcher, which is optimized for exactly this sort of task. Something like
(java.util.regex.Pattern/compile (str "^"
"(?:"
(clojure.string/join "|"
(map #(java.util.regex.Pattern/quote %)
prefixes))
")"))
Should get you a regex suitable for testing against a string (but I haven't tested it at all, so maybe I got some method names wrong or something).
The following solution finds the longest matching prefix and works surprisingly well when the map is huge and strings are relatively short. It tries to match e.g. "foobar", "fooba", "foob", "foo", "fo", "f" in order and returns the first match.
(defn prefix-match
[s m]
(->> (for [end (range (count s) 0 -1)] (.subSequence s 0 end)) ; "foo", "fo", "f"
(map m) ; match "foo", match "fo", ...
(remove nil?) ; ignore unmatched
(first))) ; Take first and longest match
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With