What's a faster way to look up a value in a list of tuples?

Tags:

I'm looking up country by ip range for tens of millions rows. I'm looking for a faster way to do the lookup.

I have 180K tuples in this form:

>>> data = ((0, 16777215, 'ZZ'),
...         (1000013824, 1000079359, 'CN'),
...         (1000079360, 1000210431, 'JP'),
...         (1000210432, 1000341503, 'JP'),
...         (1000341504, 1000603647, 'IN'))

(The integers are ip addresses converted into plain numbers.)

This does the job right, but just takes too long:

>>> ip_to_lookup = 999
>>> country_result = [country
...                   for (from, to, country) in data
...                   if (ip_to_lookup >= from) and
...                      (ip_to_lookup <= to)][0]
>>> print country_result
ZZ

Can anyone point me in the right direction to a faster way of doing this lookup? Using the method above, 100 lookups take 3 seconds. Meaning, I think, 10M rows will take several days.

631

asked Mar 17 '12 17:03

2 Answers

You can use the bisect module to perform a binary search after you sorted the dataset:

from operator import itemgetter
import bisect

data = ((0, 16777215, 'ZZ'), (1000013824, 1000079359, 'CN'), (1000079360, 1000210431, 'JP'), (1000210432, 1000341503, 'JP'), (1000341504, 1000603647, 'IN'))
sorted_data = sorted(data, key=itemgetter(0))
lower_bounds = [lower for lower,_,_ in data]

def lookup_ip(ip):
    index = bisect.bisect(lower_bounds, ip) - 1
    if index < 0:
        return None
    _, upper, country = sorted_data[index]
    return country if ip <= upper else None

print(lookup_ip(-1))          # => None
print(lookup_ip(999))         # => 'ZZ'
print(lookup_ip(16777216))    # => None
print(lookup_ip(1000013824))  # => 'CN'
print(lookup_ip(1000400000))  # => 'IN'

The algorithmic complexity of the lookup is O(log n) here, instead of O(n) for a complete list walk.

178

answered Oct 23 '22 19:10

Niklas B.

Assuming your situation meets some requirements, there is a way to get the runtime complexity to O(1) on average, but space complexity suffers.

The data must be static; all data must be processed before any lookups.
Given an arbitrary IP, there must be a way to determine its significant octets.
There must be enough space to add a key for every significant value for each country.

Below is a very naive implementation. It selects the first two octets of the IP as significant no matter what, then concatenates the significant octets as integers and incrementally adds a key for each value between the mininum and maximum. As you can probably tell, there is much room for improvement.

from socket import inet_ntoa
from struct import pack

data = ((0,             16777215,   'ZZ'),
        (1000013824,    1000079359, 'CN'),
        (1000079360,    1000210431, 'JP'),
        (1000210432,    1000341503, 'JP'),
        (1000341504,    1000603647, 'IN'))

def makedict(minip, maxip, country):
    d = {}
    for i in xrange(key(minip), key(maxip)+1):
        d[i] = country
    return d

def key(ip):
    octets = inet_ntoa(pack('>L', ip)).split('.')
    return int("".join(octets[0:2]));

lookup = {}
for lo, hi, cnt in data:
    lookup.update(makedict(lo, hi, cnt))

print lookup[key(999)]          # ZZ
print lookup[key(1000215555)]   # JP

answered Oct 23 '22 20:10

Matt Eckert

Related questions
                            
                                Python - Multiprocessing.processes become copies of the main process when run from executable
                            
                                Parallel recursive function in Python
                            
                                Python: Conditional variables based on whether nosetest is running
                            
                                Mark block based on indentation level in Vim
                            
                                How can I change the resolution of a raster using GDAL?
                            
                                How can I print and display subprocess stdout and stderr output without distortion?
                            
                                Are there any working examples of Zolera SOAP Infrastructure (ZSI)? [closed]
                            
                                couchdb-python change notifications
                            
                                Which Python to use on Windows for Numpy and friends?
                            
                                PySerial not talking to Arduino
                            
                                How do I autosize text in matplotlib python?
                            
                                pip install requirement fails
                            
                                share data using Manager() in python multiprocessing module
                            
                                Intercept operator lookup on metaclass
                            
                                Python digest/hash for string similarity
                            
                                Highlighting python code blocks in vim
                            
                                Why does this url raise BadStatusLine with httplib2 and urllib2?
                            
                                __add__ all elements of a list
                            
                                is there any kind of performance gain while using .pyc files in python?
                            
                                Invoking Jython from Python (or Vice Versa)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What's a faster way to look up a value in a list of tuples?

Tags:

python

sorting

search

tuples

exzackley

People also ask

2 Answers

Niklas B.

Matt Eckert

Recent Activity

Donate For Us