Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's a faster way to look up a value in a list of tuples?

I'm looking up country by ip range for tens of millions rows. I'm looking for a faster way to do the lookup.

I have 180K tuples in this form:

>>> data = ((0, 16777215, 'ZZ'),
...         (1000013824, 1000079359, 'CN'),
...         (1000079360, 1000210431, 'JP'),
...         (1000210432, 1000341503, 'JP'),
...         (1000341504, 1000603647, 'IN'))

(The integers are ip addresses converted into plain numbers.)

This does the job right, but just takes too long:

>>> ip_to_lookup = 999
>>> country_result = [country
...                   for (from, to, country) in data
...                   if (ip_to_lookup >= from) and
...                      (ip_to_lookup <= to)][0]
>>> print country_result
ZZ

Can anyone point me in the right direction to a faster way of doing this lookup? Using the method above, 100 lookups take 3 seconds. Meaning, I think, 10M rows will take several days.

like image 631
exzackley Avatar asked Mar 17 '12 17:03

exzackley


People also ask

How do you search a list of tuples?

Supposing the list may be long and the numbers may repeat, consider using the SortedList type from the Python sortedcontainers module. The SortedList type will automatically maintain the tuples in order by number and allow for fast searching.

How do you find the value of tuples?

Find the index of an element in tuple using index() Tuple provides a member function index() i.e. It returns the index for first occurrence of x in the tuple. Also, if element is not found in tuple then it will throw an exception ValueError.

Which is fast between list and tuple?

Creating a tuple is faster than creating a list. Creating a list is slower because two memory blocks need to be accessed. An element in a tuple cannot be removed or replaced. An element in a list can be removed or replaced.

How do you access items in a list of tuples?

In the majority of programming languages when you need to access a nested data type (such as arrays, lists, or tuples), you append the brackets to get to the innermost item. The first bracket gives you the location of the tuple in your list. The second bracket gives you the location of the item in the tuple.


2 Answers

You can use the bisect module to perform a binary search after you sorted the dataset:

from operator import itemgetter
import bisect

data = ((0, 16777215, 'ZZ'), (1000013824, 1000079359, 'CN'), (1000079360, 1000210431, 'JP'), (1000210432, 1000341503, 'JP'), (1000341504, 1000603647, 'IN'))
sorted_data = sorted(data, key=itemgetter(0))
lower_bounds = [lower for lower,_,_ in data]

def lookup_ip(ip):
    index = bisect.bisect(lower_bounds, ip) - 1
    if index < 0:
        return None
    _, upper, country = sorted_data[index]
    return country if ip <= upper else None

print(lookup_ip(-1))          # => None
print(lookup_ip(999))         # => 'ZZ'
print(lookup_ip(16777216))    # => None
print(lookup_ip(1000013824))  # => 'CN'
print(lookup_ip(1000400000))  # => 'IN'

The algorithmic complexity of the lookup is O(log n) here, instead of O(n) for a complete list walk.

like image 178
Niklas B. Avatar answered Oct 23 '22 19:10

Niklas B.


Assuming your situation meets some requirements, there is a way to get the runtime complexity to O(1) on average, but space complexity suffers.

  1. The data must be static; all data must be processed before any lookups.
  2. Given an arbitrary IP, there must be a way to determine its significant octets.
  3. There must be enough space to add a key for every significant value for each country.

Below is a very naive implementation. It selects the first two octets of the IP as significant no matter what, then concatenates the significant octets as integers and incrementally adds a key for each value between the mininum and maximum. As you can probably tell, there is much room for improvement.

from socket import inet_ntoa
from struct import pack

data = ((0,             16777215,   'ZZ'),
        (1000013824,    1000079359, 'CN'),
        (1000079360,    1000210431, 'JP'),
        (1000210432,    1000341503, 'JP'),
        (1000341504,    1000603647, 'IN'))

def makedict(minip, maxip, country):
    d = {}
    for i in xrange(key(minip), key(maxip)+1):
        d[i] = country
    return d

def key(ip):
    octets = inet_ntoa(pack('>L', ip)).split('.')
    return int("".join(octets[0:2]));

lookup = {}
for lo, hi, cnt in data:
    lookup.update(makedict(lo, hi, cnt))

print lookup[key(999)]          # ZZ
print lookup[key(1000215555)]   # JP
like image 27
Matt Eckert Avatar answered Oct 23 '22 20:10

Matt Eckert