I have a dataframe with columns A,B,C
. I have a list of tuples like [(x1,y1), (x2,y2), ...]
. I would like to delete all rows that meet the following condition:
(B=x1 && C=y1) | (B=x2 && C=y2) | ...
How can I do that in pandas? I wanted to use the isin
function, but not sure if it is possible since my list has tuples. I could do something like this:
for x,y in tuples:
df = df.drop(df[df.B==x && df.C==y].index)
Maybe there is an easier way.
Use pandas. DataFrame. drop() method to delete/remove rows with condition(s). In my earlier article, I have covered how to drop rows by index from DataFrame, and in this article, I will cover several examples of dropping rows with conditions, for example, string matching on a column value.
To remove rows of data from a dataframe based on multiple conditional statements. We use square brackets [ ] with the dataframe and put multiple conditional statements along with AND or OR operator inside it. This slices the dataframe and removes all the rows that do not satisfy the given conditions.
To delete a row from a DataFrame, use the drop() method and set the index label as the parameter.
Use pandas indexing
df.set_index(list('BC')).drop(tuples, errors='ignore').reset_index()
def linear_indexing_based(df, tuples):
idx = np.array(tuples)
BC_arr = df[['B','C']].values
shp = np.maximum(BC_arr.max(0)+1,idx.max(0)+1)
BC_IDs = np.ravel_multi_index(BC_arr.T,shp)
idx_IDs = np.ravel_multi_index(idx.T,shp)
return df[~np.in1d(BC_IDs,idx_IDs)]
def divakar(df, tuples):
idx = np.array(tuples)
mask = (df.B.values == idx[:, None, 0]) & (df.C.values == idx[:, None, 1])
return df[~mask.any(0)]
def pirsquared(df, tuples):
return df.set_index(list('BC')).drop(tuples).reset_index()
10 rows, 1 tuple
np.random.seed([3,1415])
df = pd.DataFrame(np.random.choice(range(10), (10, 3)), columns=list('ABC'))
tuples = [tuple(row) for row in np.random.choice(range(10), (1, 2))]
10,000 rows, 500 tuples
np.random.seed([3,1415])
df = pd.DataFrame(np.random.choice(range(10), (10000, 3)), columns=list('ABC'))
tuples = [tuple(row) for row in np.random.choice(range(10), (500, 2))]
Approach #1
Here's a vectorized approach using NumPy's broadcasting
-
def broadcasting_based(df, tuples):
idx = np.array(tuples)
mask = (df.B.values == idx[:, None, 0]) & (df.C.values == idx[:, None, 1])
return df[~mask.any(0)]
Sample run -
In [224]: df
Out[224]:
A B C
0 6 4 4
1 2 0 3
2 8 3 4
3 7 8 3
4 6 7 8
5 3 3 2
6 5 4 2
7 2 4 7
8 6 1 6
9 1 1 1
In [225]: tuples = [(3,4),(7,8),(1,6)]
In [226]: broadcasting_based(df,tuples)
Out[226]:
A B C
0 6 4 4
1 2 0 3
3 7 8 3
5 3 3 2
6 5 4 2
7 2 4 7
9 1 1 1
Approach #2 : To cover a generic number of columns
For a case like this, one could collapse the information from different columns into one single entry that would represent the uniqueness among all columns. This could be achieved by considering each row as indexing tuple. Thus, basically each row would become one entry. Similarly, each entry from the list of tuple that is to be matched could be reduced to a 1D
array with each tuple becoming one scalar each. Finally, we use np.in1d
to look for the correspondence, get the valid mask and have the desired rows removed dataframe, Thus, the implementation would be -
def linear_indexing_based(df, tuples):
idx = np.array(tuples)
BC_arr = df[['B','C']].values
shp = np.maximum(BC_arr.max(0)+1,idx.max(0)+1)
BC_IDs = np.ravel_multi_index(BC_arr.T,shp)
idx_IDs = np.ravel_multi_index(idx.T,shp)
return df[~np.in1d(BC_IDs,idx_IDs)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With