In pandas, given a DataFrame D:
+-----+--------+--------+--------+
| | 1 | 2 | 3 |
+-----+--------+--------+--------+
| 0 | apple | banana | banana |
| 1 | orange | orange | orange |
| 2 | banana | apple | orange |
| 3 | NaN | NaN | NaN |
| 4 | apple | apple | apple |
+-----+--------+--------+--------+
How do I return rows that have the same contents across all of its columns when there are three columns or more such that it returns this:
+-----+--------+--------+--------+
| | 1 | 2 | 3 |
+-----+--------+--------+--------+
| 1 | orange | orange | orange |
| 4 | apple | apple | apple |
+-----+--------+--------+--------+
Note that it skips rows when all values are NaN.
If this were only two columns, I usually do D[D[1]==D[2]]
but I don't know how to generalize this for more than 2 column DataFrames.
The pandas. DataFrame. duplicated() method is used to find duplicate rows in a DataFrame. It returns a boolean series which identifies whether a row is duplicate or unique.
Pandas Series: equals() function The equals() function is used to test whether two Pandas objects contain the same elements. This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal.
To find the positions of two matching columns, we first initialize a pandas dataframe with two columns of city names. Then we use where() of numpy to compare the values of two columns. This returns an array that represents the indices where the two columns have the same value.
Similar to Andy Hayden answer with check if min equal to max (then row elements are all duplicates):
df[df.apply(lambda x: min(x) == max(x), 1)]
My entry:
>>> df
0 1 2
0 apple banana banana
1 orange orange orange
2 banana apple orange
3 NaN NaN NaN
4 apple apple apple
[5 rows x 3 columns]
>>> df[df.apply(pd.Series.nunique, axis=1) == 1]
0 1 2
1 orange orange orange
4 apple apple apple
[2 rows x 3 columns]
This works because calling pd.Series.nunique
on the rows gives:
>>> df.apply(pd.Series.nunique, axis=1)
0 2
1 1
2 3
3 0
4 1
dtype: int64
Note: this would, however, keep rows which look like [nan, nan, apple]
or [nan, apple, apple]
. Usually I want that, but that might be the wrong answer for your use case.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With