I have a dataframe with columns <code>A,B,C</code>. I have a list of tuples like <code>[(x1,y1), (x2,y2), ...]</code>. I would like to delete all rows that meet the following condition: <code>(B=x1 && C=y1) | (B=x2 && C=y2) | ...</code> How can I do that in pandas? I wanted to use the <code>isin</code> function, but not sure if it is possible since my list has tuples. I could do something like this: <pre class="prettyprint"><code>for x,y in tuples: df = df.drop(df[df.B==x && df.C==y].index) </code></pre> Maybe there is an easier way.

Use pandas indexing <pre class="prettyprint"><code>df.set_index(list('BC')).drop(tuples, errors='ignore').reset_index() </code></pre> <hr> <h3>Timing</h3> <pre class="prettyprint"><code>def linear_indexing_based(df, tuples): idx = np.array(tuples) BC_arr = df[['B','C']].values shp = np.maximum(BC_arr.max(0)+1,idx.max(0)+1) BC_IDs = np.ravel_multi_index(BC_arr.T,shp) idx_IDs = np.ravel_multi_index(idx.T,shp) return df[~np.in1d(BC_IDs,idx_IDs)] def divakar(df, tuples): idx = np.array(tuples) mask = (df.B.values == idx[:, None, 0]) & (df.C.values == idx[:, None, 1]) return df[~mask.any(0)] def pirsquared(df, tuples): return df.set_index(list('BC')).drop(tuples).reset_index() </code></pre> 10 rows, 1 tuple <pre class="prettyprint"><code>np.random.seed([3,1415]) df = pd.DataFrame(np.random.choice(range(10), (10, 3)), columns=list('ABC')) tuples = [tuple(row) for row in np.random.choice(range(10), (1, 2))] </code></pre> <img src="https://i.stack.imgur.com/mwQ8M.png" alt="enter image description here"> 10,000 rows, 500 tuples <pre class="prettyprint"><code>np.random.seed([3,1415]) df = pd.DataFrame(np.random.choice(range(10), (10000, 3)), columns=list('ABC')) tuples = [tuple(row) for row in np.random.choice(range(10), (500, 2))] </code></pre> <img src="https://i.stack.imgur.com/JMjqD.png" alt="enter image description here">

Pandas: Delete rows based on multiple columns values

Tags:

python

pandas

I have a dataframe with columns A,B,C. I have a list of tuples like [(x1,y1), (x2,y2), ...]. I would like to delete all rows that meet the following condition: (B=x1 && C=y1) | (B=x2 && C=y2) | ... How can I do that in pandas? I wanted to use the isin function, but not sure if it is possible since my list has tuples. I could do something like this:

for x,y in tuples:   
    df = df.drop(df[df.B==x && df.C==y].index)

Maybe there is an easier way.

802

asked Jul 22 '16 22:07

user4979733

2 Answers

Use pandas indexing

df.set_index(list('BC')).drop(tuples, errors='ignore').reset_index()

Timing

def linear_indexing_based(df, tuples):
    idx = np.array(tuples)
    BC_arr = df[['B','C']].values
    shp = np.maximum(BC_arr.max(0)+1,idx.max(0)+1)
    BC_IDs = np.ravel_multi_index(BC_arr.T,shp)
    idx_IDs = np.ravel_multi_index(idx.T,shp)
    return df[~np.in1d(BC_IDs,idx_IDs)]

def divakar(df, tuples):
    idx = np.array(tuples)
    mask = (df.B.values == idx[:, None, 0]) & (df.C.values == idx[:, None, 1])
    return df[~mask.any(0)]

def pirsquared(df, tuples):
    return df.set_index(list('BC')).drop(tuples).reset_index()

10 rows, 1 tuple

np.random.seed([3,1415])
df = pd.DataFrame(np.random.choice(range(10), (10, 3)), columns=list('ABC'))
tuples = [tuple(row) for row in np.random.choice(range(10), (1, 2))]

enter image description here

10,000 rows, 500 tuples

np.random.seed([3,1415])
df = pd.DataFrame(np.random.choice(range(10), (10000, 3)), columns=list('ABC'))
tuples = [tuple(row) for row in np.random.choice(range(10), (500, 2))]

enter image description here

194

answered Oct 24 '22 06:10

piRSquared

Approach #1

Here's a vectorized approach using NumPy's broadcasting -

def broadcasting_based(df, tuples):
    idx = np.array(tuples)
    mask = (df.B.values == idx[:, None, 0]) & (df.C.values == idx[:, None, 1])
    return df[~mask.any(0)]

Sample run -

In [224]: df
Out[224]: 
   A  B  C
0  6  4  4
1  2  0  3
2  8  3  4
3  7  8  3
4  6  7  8
5  3  3  2
6  5  4  2
7  2  4  7
8  6  1  6
9  1  1  1

In [225]: tuples = [(3,4),(7,8),(1,6)]

In [226]: broadcasting_based(df,tuples)
Out[226]: 
   A  B  C
0  6  4  4
1  2  0  3
3  7  8  3
5  3  3  2
6  5  4  2
7  2  4  7
9  1  1  1

Approach #2 : To cover a generic number of columns

For a case like this, one could collapse the information from different columns into one single entry that would represent the uniqueness among all columns. This could be achieved by considering each row as indexing tuple. Thus, basically each row would become one entry. Similarly, each entry from the list of tuple that is to be matched could be reduced to a 1D array with each tuple becoming one scalar each. Finally, we use np.in1d to look for the correspondence, get the valid mask and have the desired rows removed dataframe, Thus, the implementation would be -

def linear_indexing_based(df, tuples):
    idx = np.array(tuples)
    BC_arr = df[['B','C']].values
    shp = np.maximum(BC_arr.max(0)+1,idx.max(0)+1)
    BC_IDs = np.ravel_multi_index(BC_arr.T,shp)
    idx_IDs = np.ravel_multi_index(idx.T,shp)
    return df[~np.in1d(BC_IDs,idx_IDs)]

answered Oct 24 '22 04:10

Divakar

Related questions
                            
                                How to use compile_commands.json with clang python bindings?
                            
                                Advancing Python generator function to just before the first yield [duplicate]
                            
                                How to create a z-score in Spark SQL for each group
                            
                                Python multiple variables on left side of assignment operator
                            
                                Function Approximation: How is tile coding different from highly discretized state space?
                            
                                Vectorized implementation to create multiple rows from a single row in pandas dataframe
                            
                                ForeignKey with multiple models
                            
                                Python "Too many indices for array"
                            
                                How to change tab size in a specific file in Pycharm
                            
                                Is looping through a generator in a loop over that same generator safe in Python?
                            
                                Find the column names which have top 3 largest values for each row
                            
                                How can I change the intensity of a colormap in matplotlib?
                            
                                Plotting hsv values with imshow
                            
                                RabbitMq - pika - python - Dropping messages when published
                            
                                Multiplication of two positive numbers gives a negative output in Python 3
                            
                                Appending to a Pandas Dataframe From a pd.read_sql Output
                            
                                Guided filter in OpenCV and Python
                            
                                stack all levels of a MultiIndex
                            
                                How to reindex a pandas DataFrame after concatenation
                            
                                Is there a pythonic way to process tree-structured dict keys?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With