Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Delete rows based on multiple columns values

Tags:

python

pandas

I have a dataframe with columns A,B,C. I have a list of tuples like [(x1,y1), (x2,y2), ...]. I would like to delete all rows that meet the following condition: (B=x1 && C=y1) | (B=x2 && C=y2) | ... How can I do that in pandas? I wanted to use the isin function, but not sure if it is possible since my list has tuples. I could do something like this:

for x,y in tuples:   
    df = df.drop(df[df.B==x && df.C==y].index)

Maybe there is an easier way.

like image 802
user4979733 Avatar asked Jul 22 '16 22:07

user4979733


People also ask

How do I delete rows in Pandas DataFrame based on multiple conditions?

Use pandas. DataFrame. drop() method to delete/remove rows with condition(s). In my earlier article, I have covered how to drop rows by index from DataFrame, and in this article, I will cover several examples of dropping rows with conditions, for example, string matching on a column value.

How do I delete rows from multiple conditions?

To remove rows of data from a dataframe based on multiple conditional statements. We use square brackets [ ] with the dataframe and put multiple conditional statements along with AND or OR operator inside it. This slices the dataframe and removes all the rows that do not satisfy the given conditions.

How do you get rid of unwanted rows in Pandas?

To delete a row from a DataFrame, use the drop() method and set the index label as the parameter.


2 Answers

Use pandas indexing

df.set_index(list('BC')).drop(tuples, errors='ignore').reset_index()

Timing

def linear_indexing_based(df, tuples):
    idx = np.array(tuples)
    BC_arr = df[['B','C']].values
    shp = np.maximum(BC_arr.max(0)+1,idx.max(0)+1)
    BC_IDs = np.ravel_multi_index(BC_arr.T,shp)
    idx_IDs = np.ravel_multi_index(idx.T,shp)
    return df[~np.in1d(BC_IDs,idx_IDs)]

def divakar(df, tuples):
    idx = np.array(tuples)
    mask = (df.B.values == idx[:, None, 0]) & (df.C.values == idx[:, None, 1])
    return df[~mask.any(0)]

def pirsquared(df, tuples):
    return df.set_index(list('BC')).drop(tuples).reset_index()

10 rows, 1 tuple

np.random.seed([3,1415])
df = pd.DataFrame(np.random.choice(range(10), (10, 3)), columns=list('ABC'))
tuples = [tuple(row) for row in np.random.choice(range(10), (1, 2))]

enter image description here

10,000 rows, 500 tuples

np.random.seed([3,1415])
df = pd.DataFrame(np.random.choice(range(10), (10000, 3)), columns=list('ABC'))
tuples = [tuple(row) for row in np.random.choice(range(10), (500, 2))]

enter image description here

like image 194
piRSquared Avatar answered Oct 24 '22 06:10

piRSquared


Approach #1

Here's a vectorized approach using NumPy's broadcasting -

def broadcasting_based(df, tuples):
    idx = np.array(tuples)
    mask = (df.B.values == idx[:, None, 0]) & (df.C.values == idx[:, None, 1])
    return df[~mask.any(0)]

Sample run -

In [224]: df
Out[224]: 
   A  B  C
0  6  4  4
1  2  0  3
2  8  3  4
3  7  8  3
4  6  7  8
5  3  3  2
6  5  4  2
7  2  4  7
8  6  1  6
9  1  1  1

In [225]: tuples = [(3,4),(7,8),(1,6)]

In [226]: broadcasting_based(df,tuples)
Out[226]: 
   A  B  C
0  6  4  4
1  2  0  3
3  7  8  3
5  3  3  2
6  5  4  2
7  2  4  7
9  1  1  1

Approach #2 : To cover a generic number of columns

For a case like this, one could collapse the information from different columns into one single entry that would represent the uniqueness among all columns. This could be achieved by considering each row as indexing tuple. Thus, basically each row would become one entry. Similarly, each entry from the list of tuple that is to be matched could be reduced to a 1D array with each tuple becoming one scalar each. Finally, we use np.in1d to look for the correspondence, get the valid mask and have the desired rows removed dataframe, Thus, the implementation would be -

def linear_indexing_based(df, tuples):
    idx = np.array(tuples)
    BC_arr = df[['B','C']].values
    shp = np.maximum(BC_arr.max(0)+1,idx.max(0)+1)
    BC_IDs = np.ravel_multi_index(BC_arr.T,shp)
    idx_IDs = np.ravel_multi_index(idx.T,shp)
    return df[~np.in1d(BC_IDs,idx_IDs)]
like image 23
Divakar Avatar answered Oct 24 '22 04:10

Divakar