I need information about any standard python package which can be used for "longest prefix match" on URLs. I have gone through the two standard packages http://packages.python.org/PyTrie/#pytrie.StringTrie & 'http://pypi.python.org/pypi/trie/0.1.1' but they don't seem to be useful for longest prefix match task on URLs.
Examlple, if my set has these URLs 1->http://www.google.com/mail , 2->http://www.google.com/document, 3->http://www.facebook.com, etc..
Now if I search for 'http://www.google.com/doc' then it should return 2 and search for 'http://www.face' should return 3.
I wanted to confirm if there is any standard python package which can help me in doing this or should I implement a Trie for prefix matching.
I am not looking for a regular-expression kind of solution since it is not scalable as the number of URL's increases.
Thanks a lot.
The Longest Match Routing Rule is an algorithm used by IP routers to select an entry from a routing table. The router uses the longest (prefix) match to determine the egress (outbound) interface and the address of the next device to which to send a packet.
Longest prefix match routing is an algorithm where the router prefers the longest prefix in the routing table. In other words, the most specific prefix. When a router receives the IP packet, it compares the destination IP address bit-by-bit with prefixes in the routing table.
Longest prefix match (also called Maximum prefix length match) refers to an algorithm used by routers in Internet Protocol (IP) networking to select an entry from a routing table. Because each entry in a forwarding table may specify a sub-network, one destination address may match more than one forwarding table entry.
suffixtree
vs. pytrie
vs. trie
vs. datrie
vs. startswith
-functionsThe recorded time is a minimum time among 3 repetitions of 1000 searches. A trie construction time is included and spread among all searches. The search is performed on collections of hostnames from 1 to 1000000 items.
Three types of a search string:
non_existent_key
- there is no match for the stringrare_key
- around 20 in a millionfrequent_key
- number of occurrences is comparable to the collection size| function | memory, | ratio |
| | GiB | |
|-------------+---------+-------|
| suffix_tree | 0.853 | 1.0 |
| pytrie | 3.383 | 4.0 |
| trie | 3.803 | 4.5 |
| datrie | 0.194 | 0.2 |
| startswith | 0.069 | 0.1 |
#+TBLFM: $3=$2/@3$2;%.1f
To reproduce the results, run the trie benchmark code.
rare_key/nonexistent_key case
if number of urls is less than 10000 then datrie is the fastest, for
N>10000 - suffixtree
is faster, startwith
is significally slower on average.
axes:
frequent_key
Upto N=100000 datrie
is the fastest (for a million urls the time is
dominated by the trie construction time).
The most time is taken by finding the longest match among found matches. So all functions behave similar as expected.
startswith
- time performance is independent from type of key.
trie
and pytrie
behave similar to each other.
datrie
- the fastest, decent memory consumption
startswith
is even more at disadvantage here because other approaches are not penalized by the time it takes to build a trie.
datrie
, pytrie
, trie
- almost O(1) (constant time) for rare/non_existent key
Fitting (approximating) polynoms of known functions for comparison (same log/log scale as in figures):
| Fitting polynom | Function |
|------------------------------+-------------------|
| 0.15 log2(N) + 1.583 | log2(N) |
| 0.30 log2(N) + 3.167 | log2(N)*log2(N) |
| 0.50 log2(N) + 1.111e-15 | sqrt(N) |
| 0.80 log2(N) + 7.943e-16 | N**0.8 |
| 1.00 log2(N) + 2.223e-15 | N |
| 2.00 log2(N) + 4.446e-15 | N*N |
This example is good for small url lists but does not scale well.
def longest_prefix_match(search, urllist):
matches = [url for url in urllist if url.startswith(search)]
if matches:
return max(matches, key=len)
else:
raise Exception("Not found")
An implementation using the trie module.
import trie
def longest_prefix_match(prefix_trie, search):
# There may well be a more elegant way to do this without using
# "hidden" method _getnode.
try:
return list(node.value for node in prefix_trie._getnode(search).walk())
except KeyError:
return list()
url_list = [
'http://www.google.com/mail',
'http://www.google.com/document',
'http://www.facebook.com',
]
url_trie = trie.Trie()
for url in url_list:
url_trie[url] = url
searches = ("http", "http://www.go", "http://www.fa", "http://fail")
for search in searches:
print "'%s' ->" % search, longest_prefix_match(url_trie, search)
Result:
'http' -> ['http://www.facebook.com', 'http://www.google.com/document', 'http://www.google.com/mail']
'http://www.go' -> ['http://www.google.com/document', 'http://www.google.com/mail']
'http://www.fa' -> ['http://www.facebook.com']
'http://fail' -> []
or using PyTrie which gives the same result but the lists are ordered differently.
from pytrie import StringTrie
url_list = [
'http://www.google.com/mail',
'http://www.google.com/document',
'http://www.facebook.com',
]
url_trie = StringTrie()
for url in url_list:
url_trie[url] = url
searches = ("http", "http://www.go", "http://www.fa", "http://fail")
for search in searches:
print "'%s' ->" % search, url_trie.values(prefix=search)
I'm beginning to think a radix tree / patricia tree would be better from a memory usage point of view. This is what the a radix tree would look like:
Whereas the trie looks more like:
The function below will return the index of the longest match. Other useful information can easily be extracted as well.
from os.path import commonprefix as oscp
def longest_prefix(s, slist):
pfx_idx = ((oscp([s, url]), i) for i, url in enumerate(slist))
len_pfx_idx = map(lambda t: (len(t[0]), t[0], t[1]), pfx_idx)
length, pfx, idx = max(len_pfx_idx)
return idx
slist = [
'http://www.google.com/mail',
'http://www.google.com/document',
'http://www.facebook.com',
]
print(longest_prefix('http://www.google.com/doc', slist))
print(longest_prefix('http://www.face', slist))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With