Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas DataFrame check if column value exists in a group of columns

I have a DataFrame like this (simplified example)

id  v0  v1  v2  v3  v4
1   10  5   10  22  50
2   22  23  55  60  50
3   8   2   40  80  110
4   15  15  25  100 101

And would like to create an additional column that is either 1 or 0. 1 if v0 value is in the values of v1 to v4, and 0 if it's not. So, in this example for id 1 then the value should be 1 (since v2 = 10) and for id 2 value should be 0 since 22 is not in v1 thru v4.

In reality the table is way bigger (around 100,000 rows and variables go from v1 to v99).

like image 351
EGM8686 Avatar asked Sep 18 '18 19:09

EGM8686


People also ask

How do you check if a column value is present in another column pandas?

You can check if a column contains/exists a particular value (string/int), list of multiple values in pandas DataFrame by using pd. series() , in operator, pandas. series. isin() , str.

How do I check if multiple columns exist in DataFrame Python?

In order to check if a list of multiple selected columns exist in pandas DataFrame, use set. issubset . For Example, if set(['Courses','Duration']). issubset(df.


1 Answers

You can use the underlying numpy arrays for performance:

Setup

a = df.v0.values
b = df.iloc[:, 2:].values

df.assign(out=(a[:, None]==b).any(1).astype(int))

   id  v0  v1  v2   v3   v4  out
0   1  10   5  10   22   50    1
1   2  22  23  55   60   50    0
2   3   8   2  40   80  110    0
3   4  15  15  25  100  101    1

This solution leverages broadcasting to allow for pairwise comparison:

First, we broadcast a:

>>> a[:, None]
array([[10],
       [22],
       [ 8],
       [15]], dtype=int64)

Which allows for pairwise comparison with b:

>>> a[:, None] == b
array([[False,  True, False, False],
       [False, False, False, False],
       [False, False, False, False],
       [ True, False, False, False]])

We then simply check for any True results along the first axis, and convert to integer.


Performance


Functions

def user_chris(df):
    a = df.v0.values
    b = df.iloc[:, 2:].values
    return (a[:, None]==b).any(1).astype(int)

def rahlf23(df):
    df = df.set_index('id')
    return df.drop('v0', 1).isin(df['v0']).any(1).astype(int)

def chris_a(df):
    return df.loc[:, "v1":].eq(df['v0'], 0).any(1).astype(int)

def chris(df):
    return df.apply(lambda x: int(x['v0'] in x.values[2:]), axis=1)

def anton_vbr(df):
    df.set_index('id', inplace=True)
    return df.isin(df.pop('v0')).any(1).astype(int)

Setup

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from timeit import timeit

res = pd.DataFrame(
       index=['user_chris', 'rahlf23', 'chris_a', 'chris', 'anton_vbr'],
       columns=[10, 50, 100, 500, 1000, 5000],
       dtype=float
)

for f in res.index:
    for c in res.columns:
        vals = np.random.randint(1, 100, (c, c))
        vals = np.column_stack((np.arange(vals.shape[0]), vals))
        df = pd.DataFrame(vals, columns=['id'] + [f'v{i}' for i in range(0, vals.shape[0])])
        stmt = '{}(df)'.format(f)
        setp = 'from __main__ import df, {}'.format(f)
        res.at[f, c] = timeit(stmt, setp, number=50)

ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N");
ax.set_ylabel("time (relative)");

plt.show()

Output

enter image description here

like image 103
user3483203 Avatar answered Oct 21 '22 20:10

user3483203