Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get rows that have the same value across its columns in pandas

In pandas, given a DataFrame D:

+-----+--------+--------+--------+   
|     |    1   |    2   |    3   |
+-----+--------+--------+--------+
|  0  | apple  | banana | banana |
|  1  | orange | orange | orange |
|  2  | banana | apple  | orange |
|  3  | NaN    | NaN    | NaN    |
|  4  | apple  | apple  | apple  |
+-----+--------+--------+--------+

How do I return rows that have the same contents across all of its columns when there are three columns or more such that it returns this:

+-----+--------+--------+--------+   
|     |    1   |    2   |    3   |
+-----+--------+--------+--------+
|  1  | orange | orange | orange |
|  4  | apple  | apple  | apple  |
+-----+--------+--------+--------+

Note that it skips rows when all values are NaN.

If this were only two columns, I usually do D[D[1]==D[2]] but I don't know how to generalize this for more than 2 column DataFrames.

like image 239
kentwait Avatar asked Jan 20 '14 10:01

kentwait


People also ask

How do I get identical rows in pandas?

The pandas. DataFrame. duplicated() method is used to find duplicate rows in a DataFrame. It returns a boolean series which identifies whether a row is duplicate or unique.

How do I check if two rows have the same value in pandas?

Pandas Series: equals() function The equals() function is used to test whether two Pandas objects contain the same elements. This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal.

How do I match column values in pandas?

To find the positions of two matching columns, we first initialize a pandas dataframe with two columns of city names. Then we use where() of numpy to compare the values of two columns. This returns an array that represents the indices where the two columns have the same value.


2 Answers

Similar to Andy Hayden answer with check if min equal to max (then row elements are all duplicates):

df[df.apply(lambda x: min(x) == max(x), 1)]
like image 193
lowtech Avatar answered Oct 13 '22 00:10

lowtech


My entry:

>>> df
        0       1       2
0   apple  banana  banana
1  orange  orange  orange
2  banana   apple  orange
3     NaN     NaN     NaN
4   apple   apple   apple

[5 rows x 3 columns]
>>> df[df.apply(pd.Series.nunique, axis=1) == 1]
        0       1       2
1  orange  orange  orange
4   apple   apple   apple

[2 rows x 3 columns]

This works because calling pd.Series.nunique on the rows gives:

>>> df.apply(pd.Series.nunique, axis=1)
0    2
1    1
2    3
3    0
4    1
dtype: int64

Note: this would, however, keep rows which look like [nan, nan, apple] or [nan, apple, apple]. Usually I want that, but that might be the wrong answer for your use case.

like image 21
DSM Avatar answered Oct 12 '22 23:10

DSM