I need information about any standard python package which can be used for "longest prefix match" on URLs. I have gone through the two standard packages http://packages.python.org/PyTrie/#pytrie.StringTrie & 'http://pypi.python.org/pypi/trie/0.1.1' but they don't seem to be useful for longest prefix match task on URLs. Examlple, if my set has these URLs 1->http://www.google.com/mail , 2->http://www.google.com/document, 3->http://www.facebook.com, etc.. Now if I search for 'http://www.google.com/doc' then it should return 2 and search for 'http://www.face' should return 3. I wanted to confirm if there is any standard python package which can help me in doing this or should I implement a Trie for prefix matching. I am not looking for a regular-expression kind of solution since it is not scalable as the number of URL's increases. Thanks a lot.

<h3>Performance comparison</h3> <h3> <code>suffixtree</code> vs. <code>pytrie</code> vs. <code>trie</code> vs. <code>datrie</code> vs. <code>startswith</code> -functions</h3> <h3>Setup</h3> The recorded time is a minimum time among 3 repetitions of 1000 searches. A trie construction time is included and spread among all searches. The search is performed on collections of hostnames from 1 to 1000000 items. Three types of a search string: <ul> <li> <code>non_existent_key</code> - there is no match for the string</li> <li> <code>rare_key</code> - around 20 in a million</li> <li> <code>frequent_key</code> - number of occurrences is comparable to the collection size</li> </ul> <h3>Results</h3> Maximum memory consumption for a million urls: <pre class="prettyprint"><code>| function | memory, | ratio | | | GiB | | |-------------+---------+-------| | suffix_tree | 0.853 | 1.0 | | pytrie | 3.383 | 4.0 | | trie | 3.803 | 4.5 | | datrie | 0.194 | 0.2 | | startswith | 0.069 | 0.1 | #+TBLFM: $3=$2/@3$2;%.1f </code></pre> To reproduce the results, run the trie benchmark code. <ul> <li> rare_key/nonexistent_key case if number of urls is less than 10000 then datrie is the fastest, for N>10000 - <code>suffixtree</code> is faster, <code>startwith</code> is significally slower on average. </li> </ul> <img src="https://i.stack.imgur.com/5BMmW.png" alt="rare_key"> <ul> <li> axes: <ul> <li>vertical (time) scale is ~1 second (2**20 microseconds)</li> <li>horizontal axis shows total number of urls in each case: N= 1, 10, 100, 1000, 10000, 100000, and 1000000 (a million).</li> </ul> </li> </ul> <img src="https://i.stack.imgur.com/EvQA6.png" alt="nonexistent_key"> <ul> <li> frequent_key Upto N=100000 <code>datrie</code> is the fastest (for a million urls the time is dominated by the trie construction time). The most time is taken by finding the longest match among found matches. So all functions behave similar as expected. </li> </ul> <img src="https://i.stack.imgur.com/QV4p0.png" alt="frequent_key"> <code>startswith</code> - time performance is independent from type of key. <code>trie</code> and <code>pytrie</code> behave similar to each other. <h3>Performance without trie construction time</h3> <ul> <li><code>datrie</code> - the fastest, decent memory consumption</li> <li><code>startswith</code> is even more at disadvantage here because other approaches are not penalized by the time it takes to build a trie.</li> <li><code>datrie</code>, <code>pytrie</code>, <code>trie</code> - almost O(1) (constant time) for rare/non_existent key</li> </ul> <img src="https://i.stack.imgur.com/Myz8w.png" alt="rare_key_no_trie_build_time"><img src="https://i.stack.imgur.com/YZApf.png" alt="nonexistent_key_no_trie_build_time"> <img src="https://i.stack.imgur.com/AwiJn.png" alt="frequent_key_no_trie_build_time"> Fitting (approximating) polynoms of known functions for comparison (same log/log scale as in figures): <pre class="prettyprint"><code>| Fitting polynom | Function | |------------------------------+-------------------| | 0.15 log2(N) + 1.583 | log2(N) | | 0.30 log2(N) + 3.167 | log2(N)*log2(N) | | 0.50 log2(N) + 1.111e-15 | sqrt(N) | | 0.80 log2(N) + 7.943e-16 | N**0.8 | | 1.00 log2(N) + 2.223e-15 | N | | 2.00 log2(N) + 4.446e-15 | N*N | </code></pre>

This example is good for small url lists but does not scale well. <pre class="prettyprint"><code>def longest_prefix_match(search, urllist): matches = [url for url in urllist if url.startswith(search)] if matches: return max(matches, key=len) else: raise Exception("Not found") </code></pre> An implementation using the trie module. <pre class="prettyprint"><code>import trie def longest_prefix_match(prefix_trie, search): # There may well be a more elegant way to do this without using # "hidden" method _getnode. try: return list(node.value for node in prefix_trie._getnode(search).walk()) except KeyError: return list() url_list = [ 'http://www.google.com/mail', 'http://www.google.com/document', 'http://www.facebook.com', ] url_trie = trie.Trie() for url in url_list: url_trie[url] = url searches = ("http", "http://www.go", "http://www.fa", "http://fail") for search in searches: print "'%s' ->" % search, longest_prefix_match(url_trie, search) </code></pre> Result: <pre class="prettyprint"><code>'http' -> ['http://www.facebook.com', 'http://www.google.com/document', 'http://www.google.com/mail'] 'http://www.go' -> ['http://www.google.com/document', 'http://www.google.com/mail'] 'http://www.fa' -> ['http://www.facebook.com'] 'http://fail' -> [] </code></pre> or using PyTrie which gives the same result but the lists are ordered differently. <pre class="prettyprint"><code>from pytrie import StringTrie url_list = [ 'http://www.google.com/mail', 'http://www.google.com/document', 'http://www.facebook.com', ] url_trie = StringTrie() for url in url_list: url_trie[url] = url searches = ("http", "http://www.go", "http://www.fa", "http://fail") for search in searches: print "'%s' ->" % search, url_trie.values(prefix=search) </code></pre> I'm beginning to think a radix tree / patricia tree would be better from a memory usage point of view. This is what the a radix tree would look like: <img src="https://i.stack.imgur.com/hYiz0.png" alt="Radix tree of example URLs"> Whereas the trie looks more like: <img src="https://i.stack.imgur.com/MfhjI.png" alt="trie of example URLs">

Longest Prefix Matches for URLs

Tags:

python

url

trie

longest-prefix

I need information about any standard python package which can be used for "longest prefix match" on URLs. I have gone through the two standard packages http://packages.python.org/PyTrie/#pytrie.StringTrie & 'http://pypi.python.org/pypi/trie/0.1.1' but they don't seem to be useful for longest prefix match task on URLs.

Examlple, if my set has these URLs 1->http://www.google.com/mail , 2->http://www.google.com/document, 3->http://www.facebook.com, etc..

Now if I search for 'http://www.google.com/doc' then it should return 2 and search for 'http://www.face' should return 3.

I wanted to confirm if there is any standard python package which can help me in doing this or should I implement a Trie for prefix matching.

I am not looking for a regular-expression kind of solution since it is not scalable as the number of URL's increases.

Thanks a lot.

937

asked Mar 25 '11 15:03

Amit

3 Answers

Performance comparison

`suffixtree` vs. `pytrie` vs. `trie` vs. `datrie` vs. `startswith` -functions

Setup

The recorded time is a minimum time among 3 repetitions of 1000 searches. A trie construction time is included and spread among all searches. The search is performed on collections of hostnames from 1 to 1000000 items.

Three types of a search string:

non_existent_key - there is no match for the string
rare_key - around 20 in a million
frequent_key - number of occurrences is comparable to the collection size

Results

Maximum memory consumption for a million urls:

| function    | memory, | ratio |
|             |     GiB |       |
|-------------+---------+-------|
| suffix_tree |   0.853 |   1.0 |
| pytrie      |   3.383 |   4.0 |
| trie        |   3.803 |   4.5 |
| datrie      |   0.194 |   0.2 |
| startswith  |   0.069 |   0.1 |
#+TBLFM: $3=$2/@3$2;%.1f

To reproduce the results, run the trie benchmark code.

rare_key/nonexistent_key case

if number of urls is less than 10000 then datrie is the fastest, for N>10000 - suffixtree is faster, startwith is significally slower on average.

rare_key

axes:
- vertical (time) scale is ~1 second (2**20 microseconds)
- horizontal axis shows total number of urls in each case: N= 1, 10, 100, 1000, 10000, 100000, and 1000000 (a million).

nonexistent_key

frequent_key

Upto N=100000 datrie is the fastest (for a million urls the time is dominated by the trie construction time).

The most time is taken by finding the longest match among found matches. So all functions behave similar as expected.

frequent_key

startswith - time performance is independent from type of key.

trie and pytrie behave similar to each other.

Performance without trie construction time

datrie - the fastest, decent memory consumption
startswith is even more at disadvantage here because other approaches are not penalized by the time it takes to build a trie.
datrie, pytrie, trie - almost O(1) (constant time) for rare/non_existent key

rare_key_no_trie_build_time nonexistent_key_no_trie_build_time

frequent_key_no_trie_build_time

Fitting (approximating) polynoms of known functions for comparison (same log/log scale as in figures):

| Fitting polynom              | Function          |
|------------------------------+-------------------|
| 0.15  log2(N)   +      1.583 | log2(N)           |
| 0.30  log2(N)   +      3.167 | log2(N)*log2(N)   |
| 0.50  log2(N)   +  1.111e-15 | sqrt(N)           |
| 0.80  log2(N)   +  7.943e-16 | N**0.8            |
| 1.00  log2(N)   +  2.223e-15 | N                 |
| 2.00  log2(N)   +  4.446e-15 | N*N               |

178

answered Oct 22 '22 15:10

jfs

This example is good for small url lists but does not scale well.

def longest_prefix_match(search, urllist):
    matches = [url for url in urllist if url.startswith(search)]
    if matches:
        return max(matches, key=len)
    else:
        raise Exception("Not found")

An implementation using the trie module.

import trie


def longest_prefix_match(prefix_trie, search):
    # There may well be a more elegant way to do this without using
    # "hidden" method _getnode.
    try:
        return list(node.value for node in prefix_trie._getnode(search).walk())
    except KeyError:
        return list()

url_list = [ 
    'http://www.google.com/mail',
    'http://www.google.com/document',
    'http://www.facebook.com',
]

url_trie = trie.Trie()

for url in url_list:
    url_trie[url] = url 

searches = ("http", "http://www.go", "http://www.fa", "http://fail")

for search in searches:
    print "'%s' ->" % search, longest_prefix_match(url_trie, search)

Result:

'http' -> ['http://www.facebook.com', 'http://www.google.com/document', 'http://www.google.com/mail']
'http://www.go' -> ['http://www.google.com/document', 'http://www.google.com/mail']
'http://www.fa' -> ['http://www.facebook.com']
'http://fail' -> []

or using PyTrie which gives the same result but the lists are ordered differently.

from pytrie import StringTrie


url_list = [ 
    'http://www.google.com/mail',
    'http://www.google.com/document',
    'http://www.facebook.com',
]

url_trie = StringTrie()

for url in url_list:
    url_trie[url] = url 

searches = ("http", "http://www.go", "http://www.fa", "http://fail")

for search in searches:
    print "'%s' ->" % search, url_trie.values(prefix=search)

I'm beginning to think a radix tree / patricia tree would be better from a memory usage point of view. This is what the a radix tree would look like:

Radix tree of example URLs

Whereas the trie looks more like: trie of example URLs

answered Oct 22 '22 15:10

Stephen Paulger

The function below will return the index of the longest match. Other useful information can easily be extracted as well.

from os.path import commonprefix as oscp

def longest_prefix(s, slist):
    pfx_idx = ((oscp([s, url]), i) for i, url in enumerate(slist))
    len_pfx_idx = map(lambda t: (len(t[0]), t[0], t[1]), pfx_idx)
    length, pfx, idx = max(len_pfx_idx)
    return idx

slist = [
    'http://www.google.com/mail',
    'http://www.google.com/document',
    'http://www.facebook.com',
]

print(longest_prefix('http://www.google.com/doc', slist))
print(longest_prefix('http://www.face', slist))

answered Oct 22 '22 16:10

Tom Zych

Related questions
                            
                                Making a beta code for a public django site
                            
                                Python piping on Windows: Why does this not work?
                            
                                Python: how to dump cookies of a mechanize.Browser instance?
                            
                                Can't find my PYTHONPATH
                            
                                Python Memory Model
                            
                                Why is Standard Input is not displayed as I type in Mac OS X Terminal application?
                            
                                Python chat : delete variables to clean memory in functions?
                            
                                Queue remote calls to a Python Twisted perspective broker?
                            
                                How to write an XML file without header in Python?
                            
                                Django model group by datetime's date
                            
                                Access from external to python development server
                            
                                Using Python's PIL, how do I enhance the contrast/saturation of an image?
                            
                                Python vs PHP speed
                            
                                getting bytes from unicode string in python
                            
                                PyQt signal with arguments of arbitrary type / PyQt_PyObject equivalent for new-style signals
                            
                                I get a 400 Bad Request error while using django-piston
                            
                                Several many to many table joins with sqlalchemy
                            
                                Modifying a variable in a module imported using from ... import *
                            
                                Orange vs NLTK for Content Classification in Python [closed]
                            
                                Cannot find appcfg.py or dev_appserver.py?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Longest Prefix Matches for URLs

Tags:

python

url

trie

longest-prefix

Amit

People also ask

3 Answers

Performance comparison

`suffixtree` vs. `pytrie` vs. `trie` vs. `datrie` vs. `startswith` -functions

Setup

Results

Performance without trie construction time

jfs

Stephen Paulger

Tom Zych

Recent Activity

Donate For Us

Longest Prefix Matches for URLs

Tags:

python

url

trie

longest-prefix

Amit

People also ask

3 Answers

Performance comparison

suffixtree vs. pytrie vs. trie vs. datrie vs. startswith -functions

Setup

Results

Performance without trie construction time

jfs

Stephen Paulger

Tom Zych

Related questions

Recent Activity

Donate For Us

`suffixtree` vs. `pytrie` vs. `trie` vs. `datrie` vs. `startswith` -functions