Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Keep rows in data frame that, for all combinations of the values of certain columns, contain the same elements in another column

Tags:

df = pd.DataFrame({'a':['x','x','x','x','x','y','y','y','y','y'],'b':['z','z','z','w','w','z','z','w','w','w'],'c':['c1','c2','c3','c1','c3','c1','c3','c1','c2','c3'],'d':range(1,11)})

   a  b   c   d
0  x  z  c1   1
1  x  z  c2   2
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
8  y  w  c2   9
9  y  w  c3  10

how can I keep only the rows that, for all combinations of a and b, contain the same values in c? Or in other words, how to exclude rows with c values that are only present in some combinations of a and b?

For example, only c1 and c3 are present in all combinations of a and b ([x,z],[x,w],[y,z],[y,w]), so the output would be

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10
like image 726
HappyPy Avatar asked Jan 13 '21 18:01

HappyPy


People also ask

How do I get unique column combinations in pandas?

You can get unique values in column (multiple columns) from pandas DataFrame using unique() or Series. unique() functions.

Can we create a data frame having any number of rows and columns?

Yes it is possible to create any shape dataframe.

Which method is used to add the rows at the end of the data frame?

append() function is used to append rows of other dataframe to the end of the given dataframe, returning a new dataframe object. Columns not in the original dataframes are added as new columns and the new cells are populated with NaN value. Parameters: other : DataFrame or Series/dict-like object, or list of these.

How do I get unique values in multiple columns in pandas?

Pandas series aka columns has a unique() method that filters out only unique values from a column. The first output shows only unique FirstNames. We can extend this method using pandas concat() method and concat all the desired columns into 1 single column and then find the unique of the resultant column.


2 Answers

Here is one way. Get unique lists per group and then check common elements across all the returned arrays using reduce and np.intersect1d. Then filter the dataframe using series.isin and boolean indexing

from functools import reduce
out = df[df['c'].isin(reduce(np.intersect1d,df.groupby(['a','b'])['c'].unique()))]

Breakdown:

s = df.groupby(['a','b'])['c'].unique()
common_elements = reduce(np.intersect1d,s)
#Returns :-> array(['c1', 'c3'], dtype=object)

out = df[df['c'].isin(common_elements )]#.copy()

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10
like image 57
anky Avatar answered Oct 09 '22 22:10

anky


Lets try groupby with nunique to count of unique elements per column c group:

s = df['a'] + ',' + df['b'] # combination of a, b
m = s.groupby(df['c']).transform('nunique').eq(s.nunique())

df[m]

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10
like image 40
Shubham Sharma Avatar answered Oct 09 '22 20:10

Shubham Sharma