Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: which is a fast way to find index in pandas dataframe?

I have a dataframe like the following

df = 
    a   ID1         ID2         Proximity
0   0   900000498   NaN         0.000000
1   1   900000498   900004585   3.900000
2   2   900000498   900005562   3.900000
3   3   900000498   900008613   0.000000
4   4   900000498   900012333   0.000000
5   5   900000498   900019524   3.900000
6   6   900000498   900019877   0.000000
7   7   900000498   900020141   3.900000
8   8   900000498   900022133   3.900000
9   9   900000498   900022919   0.000000

I want to find for a given couple ID1-ID2 the corresponding Proximity value. For instance given the input [900000498, 900022133] I want as output 3.900000

like image 435
emax Avatar asked Jan 30 '16 22:01

emax


People also ask

How do you find the index value of a DataFrame in Python?

To get the index of a Pandas DataFrame, call DataFrame. index property. The DataFrame. index property returns an Index object representing the index of this DataFrame.

What is faster than pandas DataFrame?

NumPy has all of the computation capabilities of Pandas, but uses pre-compiled, optimized methods. This mean NumPy can be significantly faster than Pandas. Converting a DataFrame from Pandas to NumPy is relatively straightforward.

Is pandas indexing fast?

Like a Python dictionary (or a relational database's index), Pandas indexing provides a fast way to turn a key into a value. For example, we can create a dataframe with index alpha : and then turn the key b into the row of interest.

Which is faster DataFrame or list?

From the above, we can see that for summation, the DataFrame implementation is only slightly faster than the List implementation. This difference is much more pronounced for the more complicated Haversine function, where the DataFrame implementation is about 10X faster than the List implementation.


1 Answers

If this is a common operation then I'd set the index to those columns and then you can perform the index lookup using loc and pass a tuple of the col values:

In [60]:
df1 = df.set_index(['ID1','ID2'])

In [61]:
%timeit df1.loc[(900000498,900022133), 'Proximity']
%timeit df.loc[(df['ID1']==900000498)&(df['ID2']==900022133), 'Proximity']
1000 loops, best of 3: 565 µs per loop
100 loops, best of 3: 1.69 ms per loop

You can see that once the cols form the index then lookup is 3x faster than a filter operation.

The output is pretty much the same:

In [63]:
print(df1.loc[(900000498,900022133), 'Proximity'])
print(df.loc[(df['ID1']==900000498)&(df['ID2']==900022133), 'Proximity'])

3.9
8    3.9
Name: Proximity, dtype: float64
like image 167
EdChum Avatar answered Nov 03 '22 19:11

EdChum