Pandas DataFrame check if column value exists in a group of columns

Tags:

I have a DataFrame like this (simplified example)

id  v0  v1  v2  v3  v4
1   10  5   10  22  50
2   22  23  55  60  50
3   8   2   40  80  110
4   15  15  25  100 101

And would like to create an additional column that is either 1 or 0. 1 if v0 value is in the values of v1 to v4, and 0 if it's not. So, in this example for id 1 then the value should be 1 (since v2 = 10) and for id 2 value should be 0 since 22 is not in v1 thru v4.

In reality the table is way bigger (around 100,000 rows and variables go from v1 to v99).

351

asked Sep 18 '18 19:09

EGM8686

1 Answers

You can use the underlying numpy arrays for performance:

Setup

a = df.v0.values
b = df.iloc[:, 2:].values

df.assign(out=(a[:, None]==b).any(1).astype(int))

   id  v0  v1  v2   v3   v4  out
0   1  10   5  10   22   50    1
1   2  22  23  55   60   50    0
2   3   8   2  40   80  110    0
3   4  15  15  25  100  101    1

This solution leverages broadcasting to allow for pairwise comparison:

First, we broadcast a:

>>> a[:, None]
array([[10],
       [22],
       [ 8],
       [15]], dtype=int64)

Which allows for pairwise comparison with b:

>>> a[:, None] == b
array([[False,  True, False, False],
       [False, False, False, False],
       [False, False, False, False],
       [ True, False, False, False]])

We then simply check for any True results along the first axis, and convert to integer.

Performance

Functions

def user_chris(df):
    a = df.v0.values
    b = df.iloc[:, 2:].values
    return (a[:, None]==b).any(1).astype(int)

def rahlf23(df):
    df = df.set_index('id')
    return df.drop('v0', 1).isin(df['v0']).any(1).astype(int)

def chris_a(df):
    return df.loc[:, "v1":].eq(df['v0'], 0).any(1).astype(int)

def chris(df):
    return df.apply(lambda x: int(x['v0'] in x.values[2:]), axis=1)

def anton_vbr(df):
    df.set_index('id', inplace=True)
    return df.isin(df.pop('v0')).any(1).astype(int)

Setup

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from timeit import timeit

res = pd.DataFrame(
       index=['user_chris', 'rahlf23', 'chris_a', 'chris', 'anton_vbr'],
       columns=[10, 50, 100, 500, 1000, 5000],
       dtype=float
)

for f in res.index:
    for c in res.columns:
        vals = np.random.randint(1, 100, (c, c))
        vals = np.column_stack((np.arange(vals.shape[0]), vals))
        df = pd.DataFrame(vals, columns=['id'] + [f'v{i}' for i in range(0, vals.shape[0])])
        stmt = '{}(df)'.format(f)
        setp = 'from __main__ import df, {}'.format(f)
        res.at[f, c] = timeit(stmt, setp, number=50)

ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N");
ax.set_ylabel("time (relative)");

plt.show()

Output

enter image description here

103

answered Oct 21 '22 20:10

user3483203

Related questions
                            
                                How can I detect whether I'm running in a Celery worker?
                            
                                Broadcast an operation along specific axis in python
                            
                                Why does incorrect assignment to a global variable raise exception early?
                            
                                Python Mock Patch multiple methods in a class
                            
                                Django - using multiple foreign key to the same model
                            
                                Heatmap on top of image
                            
                                How to join my Telegram Bot to PUBLIC channel
                            
                                Python in terminal: how to signify end of for loop?
                            
                                Confusion between Python and Anaconda
                            
                                Interval datatype in Pandas - find midpoint, left, center etc
                            
                                How can I set the same type as class in method's parameter following PEP484? [duplicate]
                            
                                Batch request with Google Cloud Storage python client
                            
                                Error OPTIONS net::ERR_CONNECTION_REFUSED
                            
                                Tensorflow Object detection API: Print detected class as output to terminal
                            
                                How to use Pagination in a Non-Generic View/Viewset?
                            
                                AttributeError: module 'mysql' has no attribute 'connector'
                            
                                Should I use a Python deque or list as a stack? [duplicate]
                            
                                Suppress Scientific Format in a Dataframe Column
                            
                                How can my Model primary key start with a specific number?
                            
                                Why is a.dot(b) faster than a@b although Numpy recommends a@b

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas DataFrame check if column value exists in a group of columns

Tags:

python

pandas

dataframe

numpy

EGM8686

People also ask

1 Answers

user3483203

Recent Activity

Donate For Us