I'm looking up country by ip range for tens of millions rows. I'm looking for a faster way to do the lookup.
I have 180K tuples in this form:
>>> data = ((0, 16777215, 'ZZ'),
... (1000013824, 1000079359, 'CN'),
... (1000079360, 1000210431, 'JP'),
... (1000210432, 1000341503, 'JP'),
... (1000341504, 1000603647, 'IN'))
(The integers are ip addresses converted into plain numbers.)
This does the job right, but just takes too long:
>>> ip_to_lookup = 999
>>> country_result = [country
... for (from, to, country) in data
... if (ip_to_lookup >= from) and
... (ip_to_lookup <= to)][0]
>>> print country_result
ZZ
Can anyone point me in the right direction to a faster way of doing this lookup? Using the method above, 100 lookups take 3 seconds. Meaning, I think, 10M rows will take several days.
Supposing the list may be long and the numbers may repeat, consider using the SortedList type from the Python sortedcontainers module. The SortedList type will automatically maintain the tuples in order by number and allow for fast searching.
Find the index of an element in tuple using index() Tuple provides a member function index() i.e. It returns the index for first occurrence of x in the tuple. Also, if element is not found in tuple then it will throw an exception ValueError.
Creating a tuple is faster than creating a list. Creating a list is slower because two memory blocks need to be accessed. An element in a tuple cannot be removed or replaced. An element in a list can be removed or replaced.
In the majority of programming languages when you need to access a nested data type (such as arrays, lists, or tuples), you append the brackets to get to the innermost item. The first bracket gives you the location of the tuple in your list. The second bracket gives you the location of the item in the tuple.
You can use the bisect
module to perform a binary search after you sorted the dataset:
from operator import itemgetter
import bisect
data = ((0, 16777215, 'ZZ'), (1000013824, 1000079359, 'CN'), (1000079360, 1000210431, 'JP'), (1000210432, 1000341503, 'JP'), (1000341504, 1000603647, 'IN'))
sorted_data = sorted(data, key=itemgetter(0))
lower_bounds = [lower for lower,_,_ in data]
def lookup_ip(ip):
index = bisect.bisect(lower_bounds, ip) - 1
if index < 0:
return None
_, upper, country = sorted_data[index]
return country if ip <= upper else None
print(lookup_ip(-1)) # => None
print(lookup_ip(999)) # => 'ZZ'
print(lookup_ip(16777216)) # => None
print(lookup_ip(1000013824)) # => 'CN'
print(lookup_ip(1000400000)) # => 'IN'
The algorithmic complexity of the lookup is O(log n)
here, instead of O(n)
for a complete list walk.
Assuming your situation meets some requirements, there is a way to get the runtime complexity to O(1)
on average, but space complexity suffers.
Below is a very naive implementation. It selects the first two octets of the IP as significant no matter what, then concatenates the significant octets as integers and incrementally adds a key for each value between the mininum and maximum. As you can probably tell, there is much room for improvement.
from socket import inet_ntoa
from struct import pack
data = ((0, 16777215, 'ZZ'),
(1000013824, 1000079359, 'CN'),
(1000079360, 1000210431, 'JP'),
(1000210432, 1000341503, 'JP'),
(1000341504, 1000603647, 'IN'))
def makedict(minip, maxip, country):
d = {}
for i in xrange(key(minip), key(maxip)+1):
d[i] = country
return d
def key(ip):
octets = inet_ntoa(pack('>L', ip)).split('.')
return int("".join(octets[0:2]));
lookup = {}
for lo, hi, cnt in data:
lookup.update(makedict(lo, hi, cnt))
print lookup[key(999)] # ZZ
print lookup[key(1000215555)] # JP
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With