Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: fastest way to resolve IP to country

I have a function find_country_from_connection_ip which takes an ip, and after some processing returns a country. Like below:

def find_country_from_connection_ip(ip):
    # Do some processing
    return county

I am using the function inside apply method. like below:

df['Country'] = df.apply(lambda x: find_country_from_ip(x['IP']), axis=1)

As it is pretty straightforward, what I want is to evaluate a new column from an existing column in the DataFrame which has >400000 rows.

It runs, but terribly slow and throws an exception like below:

...........: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

if name == 'main': In [38]:

I understand the problem, but can't quite figure out how to use loc with apply and lambda.

N.B. Please suggest if you have a more efficient alternative solution, which can bring the end result.

**** EDIT ********

The function is mainly a lookup on mmdb database like below:

def find_country_from_ip(ip):
    result = subprocess.Popen("mmdblookup --file GeoIP2-Country.mmdb --ip {} country names en".format(ip).split(" "), stdout=subprocess.PIPE).stdout.read()
    if result:
        return re.search(r'\"(.+?)\"', result).group(1) 
    else:
        final_output = subprocess.Popen("mmdblookup --file GeoIP2-Country.mmdb --ip {} registered_country names en".format(ip).split(" "), stdout=subprocess.PIPE).stdout.read()
        return re.search(r'\"(.+?)\"', final_output).group(1)

This is nevertheless a costly operation, and when you have a DataFrame with >400000 rows, it should take time. But how much? That is the question. It takes about 2 hours which is pretty much I think.

like image 405
Ahsanul Haque Avatar asked Oct 24 '16 05:10

Ahsanul Haque


1 Answers

I would use maxminddb-geolite2 (GeoLite) module for that.

First install maxminddb-geolite2 module

pip install maxminddb-geolite2

Python Code:

import pandas as pd
from geolite2 import geolite2

def get_country(ip):
    try:
        x = geo.get(ip)
    except ValueError:
        return pd.np.nan
    try:
        return x['country']['names']['en'] if x else pd.np.nan
    except KeyError:
        return pd.np.nan

geo = geolite2.reader()

# it took me quite some time to find a free and large enough list of IPs ;)
# IP's for testing: http://upd.emule-security.org/ipfilter.zip
x = pd.read_csv(r'D:\download\ipfilter.zip',
                usecols=[0], sep='\s*\-\s*',
                header=None, names=['ip'])

# get unique IPs
unique_ips = x['ip'].unique()
# make series out of it
unique_ips = pd.Series(unique_ips, index = unique_ips)
# map IP --> country
x['country'] = x['ip'].map(unique_ips.apply(get_country))

geolite2.close()

Output:

In [90]: x
Out[90]:
                     ip     country
0       000.000.000.000         NaN
1       001.002.004.000         NaN
2       001.002.008.000         NaN
3       001.009.096.105         NaN
4       001.009.102.251         NaN
5       001.009.106.186         NaN
6       001.016.000.000         NaN
7       001.055.241.140         NaN
8       001.093.021.147         NaN
9       001.179.136.040         NaN
10      001.179.138.224    Thailand
11      001.179.140.200    Thailand
12      001.179.146.052         NaN
13      001.179.147.002    Thailand
14      001.179.153.216    Thailand
15      001.179.164.124    Thailand
16      001.179.167.188    Thailand
17      001.186.188.000         NaN
18      001.202.096.052         NaN
19      001.204.179.141       China
20      002.051.000.165         NaN
21      002.056.000.000         NaN
22      002.095.041.202         NaN
23      002.135.237.106  Kazakhstan
24      002.135.237.250  Kazakhstan
...                 ...         ...

Timing: for 171.884 unique IPs:

In [85]: %timeit unique_ips.apply(get_country)
1 loop, best of 3: 14.8 s per loop

In [86]: unique_ips.shape
Out[86]: (171884,)

Conclusion: it would take approx. 35 seconds for you DF with 400K unique IPs on my hardware:

In [93]: 400000/171884*15
Out[93]: 34.90726303786274
like image 148
MaxU - stop WAR against UA Avatar answered Nov 14 '22 22:11

MaxU - stop WAR against UA