Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting duplicates into separate table - Pandas

In Pandas, I can drop duplicate rows inside a database based on a single column using the

data.drop_duplicates('foo')

command. I'm wondering if there is a way to catch this data in another table for independent review.

like image 504
roffster Avatar asked Jul 12 '16 02:07

roffster


2 Answers

You can call the duplicated method on the foo column and then subset your original data frame based on it, something like this:

data.loc[data['foo'].duplicated(), :]

As an example:

data = pd.DataFrame({'foo': [1,1,1,2,2,2], 'bar': [1,1,2,2,3,3]})    
data

# bar foo
#0  1   1
#1  1   1
#2  2   1
#3  2   2
#4  3   2
#5  3   2


data.loc[data['foo'].duplicated(), :]
# bar foo
#1  1   1
#2  2   1
#4  3   2
#5  3   2
like image 173
Psidom Avatar answered Sep 20 '22 14:09

Psidom


drop_duplicates has a companion method duplicated. They both take similar arguments.

The key arguments are:

  • subset - column label or sequence of labels
    • Only consider certain columns for identifying duplicates, by default use all of the columns
  • keep - {‘first’, ‘last’, False}, default ‘first’

When keep is set to 'first':

  • drop_duplicates returns a dataframe in which the first occurrence of the combination of columns specified by subset is kept and drops the rest.
  • duplicated returns a boolean mask indexed with the same index as the original dataframe with a value for True for all duplicated combinations of the specified set of columns except for the 'first'. You can use this mask to either get at the rows to be dropped or it's complement (the same as drop_duplicates)

Example

df = pd.DataFrame(list('abcdbef'), columns=['letter'])
df

enter image description here

df.drop_duplicates(keep='first')  # same as default

enter image description here

df.duplicated(keep='first')  # same as default

0    False
1    False
2    False
3    False
4     True
5    False
6    False
dtype: bool

Notice the row corresponding to the first instance of 'b' is False while the second instance is True indicating it is to be dropped.

Answer

df[df.duplicated(keep='first')]

enter image description here


keep='last' and keep=False

Here are examples of what it looks like with the keep argument set to 'last' or False

drop duplicates

df.drop_duplicates(keep='last')

enter image description here

df.duplicated(keep='last')

0    False
1     True
2    False
3    False
4    False
5    False
6    False
dtype: bool

This time the first instance is True indicating it is to be dropped while the second instance is False indicating it is not to be dropped.

just the duplicates

df[df.duplicated(keep='last')]

enter image description here

drop duplicates

df.drop_duplicates(keep=False)

enter image description here

df.duplicated(keep=False)

0    False
1     True
2    False
3    False
4     True
5    False
6    False
dtype: bool

This time both instances are True and both are dropped.

just the duplicates

df[df.duplicated(keep=False)]

enter image description here

like image 24
piRSquared Avatar answered Sep 20 '22 14:09

piRSquared