I'd like to be able to compute descriptive statistics on data in a Pandas DataFrame, but I only care about duplicated entries. For example, let's say I have the DataFrame created by:
import pandas as pd
data={'key1':[1,2,3,1,2,3,2,2],'key2':[2,2,1,2,2,4,2,2],'data':[5,6,2,6,1,6,2,8]}
frame=pd.DataFrame(data,columns=['key1','key2','data'])
print frame
key1 key2 data
0 1 2 5
1 2 2 6
2 3 1 2
3 1 2 6
4 2 2 1
5 3 4 6
6 2 2 2
7 2 2 8
As you can see, rows 0,1,3,4,6, and 7 are all duplicates (using 'key1' and 'key2'. However, if I index this DataFrame like so:
frame[frame.duplicated(['key1','key2'])]
I get
key1 key2 data
3 1 2 6
4 2 2 1
6 2 2 2
7 2 2 8
(i.e., the 1st and 2nd rows do not show up because they are not indexed to True by the duplicated method).
That is my first problem. My second problems deals with how to extract the descriptive statistics from this information. Forgetting the missing duplicate for the moment, let's say I want to compute the .min() and .max() for the duplicate entries (so that I can get a range). I can use groupby and these methods on the groupby object like so:
a.groupby(['key1','key2']).min()
which gives
key1 key2 data
key1 key2
1 2 1 2 6
2 2 2 2 1
The data I want is obviously here, but what's the best way for me to extract it? How do I index the resulting object to get what I want (which is the key1,key2,data info)?
The pandas. DataFrame. duplicated() method is used to find duplicate rows in a DataFrame. It returns a boolean series which identifies whether a row is duplicate or unique.
Pandas DataFrame. duplicated() function is used to get/find/select a list of all duplicate rows(all or selected columns) from pandas. Duplicate rows means, having multiple rows on all columns. Using this method you can get duplicate rows on selected multiple columns or all columns.
Pandas DataFrame duplicated() Method The duplicated() method returns a Series with True and False values that describe which rows in the DataFrame are duplicated and not. Use the subset parameter to specify if any columns should not be considered when looking for duplicates.
EDIT for Pandas 0.17 or later:
As the take_last
argument of the duplicated()
method was deprecated in favour of the new keep
argument since Pandas 0.17, please refer to this answer for the correct approach:
duplicated()
method with keep=False
, i.e. frame.duplicated(['key1', 'key2'], keep=False)
.Therefore, in order to extract the required data for this specific question, the following suffices:
In [81]: frame[frame.duplicated(['key1', 'key2'], keep=False)].groupby(('key1', 'key2')).min()
Out[81]:
data
key1 key2
1 2 5
2 2 1
[2 rows x 1 columns]
Interestingly enough, this change in Pandas 0.17 may be partially attributed to this question, as referred to in this issue.
For versions preceding Pandas 0.17:
We can play with the take_last
argument of the duplicated()
method:
take_last
:boolean
, defaultFalse
For a set of distinct duplicate rows, flag all but the last row as duplicated. Default is for all but the first row to be flagged.
If we set take_last
's value to True
, we flag all but the last duplicate row. Combining this along with its default value of False
, which flags all but the first duplicate row, allows us to flag all duplicated rows:
In [76]: frame.duplicated(['key1', 'key2'])
Out[76]:
0 False
1 False
2 False
3 True
4 True
5 False
6 True
7 True
dtype: bool
In [77]: frame.duplicated(['key1', 'key2'], take_last=True)
Out[77]:
0 True
1 True
2 False
3 False
4 True
5 False
6 True
7 False
dtype: bool
In [78]: frame.duplicated(['key1', 'key2'], take_last=True) | frame.duplicated(['key1', 'key2'])
Out[78]:
0 True
1 True
2 False
3 True
4 True
5 False
6 True
7 True
dtype: bool
In [79]: frame[frame.duplicated(['key1', 'key2'], take_last=True) | frame.duplicated(['key1', 'key2'])]
Out[79]:
key1 key2 data
0 1 2 5
1 2 2 6
3 1 2 6
4 2 2 1
6 2 2 2
7 2 2 8
[6 rows x 3 columns]
Now we just need to use the groupby
and min
methods, and I believe the output is in the required format:
In [81]: frame[frame.duplicated(['key1', 'key2'], take_last=True) | frame.duplicated(['key1', 'key2'])].groupby(('key1', 'key2')).min()
Out[81]:
data
key1 key2
1 2 5
2 2 1
[2 rows x 1 columns]
To get a list of all the duplicated entries with Pandas version 0.17, you can simply set 'keep = False' in the duplicated function.
frame[frame.duplicated(['key1','key2'],keep=False)]
key1 key2 data
0 1 2 5
1 2 2 6
3 1 2 6
4 2 2 1
6 2 2 2
7 2 2 8
Here's one possible solution to return all duplicated values in the two columns (i.e. rows 0, 1, 3, 4, 6, 7):
>>> key1_dups = frame.key1[frame.key1.duplicated()].values
>>> key2_dups = frame.key2[frame.key2.duplicated()].values
>>> frame[frame.key1.isin(key1_dups) & frame.key2.isin(key2_dups)]
key1 key2 data
0 1 2 5
1 2 2 6
3 1 2 6
4 2 2 1
6 2 2 2
7 2 2 8
(Edit: actually, the df.duplicated(take_last=True) | df.duplicated()
method in @Yoel's answer is neater.)
To query the results of your groupby
operation, you can use loc
. For example:
>>> dups = frame[frame.key1.isin(key1_dups) & frame.key2.isin(key2_dups)]
>>> grouped = dups.groupby(['key1','key2']).min()
>>> grouped
data
key1 key2
1 2 5
2 2 1
>>> grouped.loc[1, 2]
data 5
Name: (1, 2), dtype: int64
Alternatively, turn grouped
back into a "normal-looking" DataFrame by resetting both indexes:
>>> grouped.reset_index(level=0).reset_index(level=0)
key2 key1 data
0 2 1 5
1 2 2 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With