Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Keeping the last N duplicates in pandas

Given a dataframe:

>>> import pandas as pd
>>> lol = [['a', 1, 1], ['b', 1, 2], ['c', 1, 4], ['c', 2, 9], ['b', 2, 10], ['x', 2, 5], ['d', 2, 3], ['e', 3, 5], ['d', 2, 10], ['a', 3, 5]]
>>> df = pd.DataFrame(lol)

>>> df.rename(columns={0:'value', 1:'key', 2:'something'})
  value  key  something
0     a    1          1
1     b    1          2
2     c    1          4
3     c    2          9
4     b    2         10
5     x    2          5
6     d    2          3
7     e    3          5
8     d    2         10
9     a    3          5

The goal is to keep the last N rows for the unique values of the key column.

If N=1, I could simply use the .drop_duplicates() function as such:

>>> df.drop_duplicates(subset='key', keep='last')
  value  key  something
2     c    1          4
8     d    2         10
9     a    3          5

How do I keep the last 3 rows for each unique values of key?


I could try this for N=3:

>>> from itertools import chain
>>> unique_keys = {k:[] for k in df['key']}
>>> for idx, row in df.iterrows():
...     k = row['key']
...     unique_keys[k].append(list(row))
... 
>>>
>>> df = pd.DataFrame(list(chain(*[v[-3:] for k,v in unique_keys.items()])))
>>> df.rename(columns={0:'value', 1:'key', 2:'something'})
  value  key  something
0     a    1          1
1     b    1          2
2     c    1          4
3     x    2          5
4     d    2          3
5     d    2         10
6     e    3          5
7     a    3          5

But there must be a better way...

like image 990
alvas Avatar asked Oct 17 '17 01:10

alvas


People also ask

What is keep =' last in Python?

keep: allowed values are {'first', 'last', False}, default 'first'. If 'first', duplicate rows except the first one is deleted. If 'last', duplicate rows except the last one is deleted. If False, all the duplicate rows are deleted. inplace: if True, the source DataFrame is changed and None is returned.

What does Drop_duplicates do in pandas?

Pandas DataFrame drop_duplicates() Method The drop_duplicates() method removes duplicate rows. Use the subset parameter if only some specified columns should be considered when looking for duplicates.

Does pandas drop duplicates keep first?

Drop duplicates but keep first drop_duplicates() . The rows that contain the same values in all the columns then are identified as duplicates. If the row is duplicated then by default DataFrame. drop_duplicates() keeps the first occurrence of that row and drops all other duplicates of it.


2 Answers

Is this what you want ?

df.groupby('key').tail(3)
Out[127]: 
  value  key  something
0     a    1          1
1     b    1          2
2     c    1          4
5     x    2          5
6     d    2          3
7     e    3          5
8     d    2         10
9     a    3          5
like image 185
BENY Avatar answered Oct 20 '22 19:10

BENY


Does this help:

for k,v in df.groupby('key'):
    print v[-2:]

  value  key  something
1     b    1          2
2     c    1          4
  value  key  something
6     d    2          3
8     d    2         10
  value  key  something
7     e    3          5
9     a    3          5
like image 31
Merlin Avatar answered Oct 20 '22 18:10

Merlin