Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

check if pair of values is in pair of columns in pandas

Basically, I have latitude and longitude (on a grid) in two different columns. I am getting fed two-element lists (could be numpy arrays) of a new coordinate set and I want to check if it is a duplicate before I add it.

For example, my data:

df = pd.DataFrame([[4,8, 'wolf', 'Predator', 10],
              [5,6,'cow', 'Prey', 10],
              [8, 2, 'rabbit', 'Prey', 10],
              [5, 3, 'rabbit', 'Prey', 10],
              [3, 2, 'cow', 'Prey', 10],
              [7, 5, 'rabbit', 'Prey', 10]],
              columns = ['lat', 'long', 'name', 'kingdom', 'energy'])

newcoords1 = [4,4]
newcoords2 = [7,5]

Is it possible to write one if statement to tell me whether there is already a row with that latitude and longitude. In pseudo code:

if newcoords1 in df['lat', 'long']:
    print('yes! ' + str(newcoords1))

(In the example, newcoords1 should be false and newcoords2 should be true.

Sidenote: (newcoords1[0] in df['lat']) & (newcoords1[1] in df['long']) doesn't work because that checks them independently, but I need to know if that combination appears in a single row.

Thank you in advance!

like image 732
seth127 Avatar asked Aug 23 '16 19:08

seth127


People also ask

How do you check if a value is in a panda column?

You can check if a column contains/exists a particular value (string/int), list of multiple values in pandas DataFrame by using pd. series() , in operator, pandas. series.

How do I compare two DataFrame values?

The compare method in pandas shows the differences between two DataFrames. It compares two data frames, row-wise and column-wise, and presents the differences side by side. The compare method can only compare DataFrames of the same shape, with exact dimensions and identical row and column labels.


3 Answers

you can do it this way:

In [140]: df.query('@newcoords2[0] == lat and @newcoords2[1] == long')
Out[140]:
   lat  long    name kingdom  energy
5    7     5  rabbit    Prey      10

In [146]: df.query('@newcoords2[0] == lat and @newcoords2[1] == long').empty
Out[146]: False

the following line will return a number of found rows:

In [147]: df.query('@newcoords2[0] == lat and @newcoords2[1] == long').shape[0]
Out[147]: 1

or using NumPy approach:

In [103]: df[(df[['lat','long']].values == newcoords2).all(axis=1)]
Out[103]:
   lat  long    name kingdom  energy
5    7     5  rabbit    Prey      10

this will show whether at least one row has been found:

In [113]: (df[['lat','long']].values == newcoords2).all(axis=1).any()
Out[113]: True

In [114]: (df[['lat','long']].values == newcoords1).all(axis=1).any()
Out[114]: False

Explanation:

In [104]: df[['lat','long']].values == newcoords2
Out[104]:
array([[False, False],
       [False, False],
       [False, False],
       [False, False],
       [False, False],
       [ True,  True]], dtype=bool)

In [105]: (df[['lat','long']].values == newcoords2).all(axis=1)
Out[105]: array([False, False, False, False, False,  True], dtype=bool)
like image 117
MaxU - stop WAR against UA Avatar answered Oct 20 '22 00:10

MaxU - stop WAR against UA


for people like me who came here by searching how to check if several pairs of values are in a pair of columns within a big dataframe, here an answer.

Let a list newscoord = [newscoord1, newscoord2, ...] and you want to extract the rows of df matching the elements of this list. Then for the example above:

v = pd.Series( [ str(i) + str(j) for i,j in df[['lat', 'long']].values ] )
w = [ str(i) + str(j) for i,j in newscoord ]

df[ v.isin(w) ]

Which gives the same output as @MaxU, but it allows to extract several rows in once.

On my computer, for a df with 10,000 rows, it takes 0.04s to run.

Of course, if your elements are already strings, it is simpler to use join instead of concatenation.

Furthermore, if the order of elements in the pair does not matter, you have to sort first:

v = pd.Series( [ str(i) + str(j) for i,j in np.sort( df[['lat','long']] ) ] )
w = [ str(i) + str(j) for i,j in np.sort( newscoord ) ]

To be noted that if v is not converted into a series and one uses np.isin(v,w), or i w is converted into a series, it would require more run time when newscoord reaches thousands of elements.

Hope it helps.

like image 41
Michaël Avatar answered Oct 19 '22 22:10

Michaël


x, y = newcoords1

>>> df[(df.lat == x) & (df.long == y)].empty
True  # Coordinates are not in the dataframe, so you can add it.

x, y = newcoords2

>>> df[(df.lat == x) & (df.long == y)].empty
False  # Coordinates already exist.
like image 21
Alexander Avatar answered Oct 20 '22 00:10

Alexander