Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to keep only the consecutive values in a Pandas dataframe using Python

I have a dataframe that looks like the this:

Enter image description here

I want to keep only the consecutive years in each group, such as the following figure where the year of 2005 in group A and year of 2009 and 2011 in group B are deleted.

Enter image description here

I created a column of the year difference by using df['year_diff']=df.groupby(['group'])['Year'].diff(), and then only kept the rows where the year difference was equal to 1.

However, this method will also delete the first row in each consecutive year group since the year difference of the first row will be NAN. For example, the year of 2000 will be deleted from group 2000-2005. Is there a way that I can do to avoid this problem?

like image 779
yihao ren Avatar asked May 20 '19 13:05

yihao ren


People also ask

How do you keep unique values in a DataFrame in Python?

You can get unique values in column (multiple columns) from pandas DataFrame using unique() or Series. unique() functions. unique() from Series is used to get unique values from a single column and the other one is used to get from multiple columns.

How do you exclude rows from a DataFrame in Python?

By using pandas. DataFrame. drop() method you can drop/remove/delete rows from DataFrame.

How do I select only a few rows in Python DataFrame?

Such a Series of boolean values can be used to filter the DataFrame by putting it in between the selection brackets [] . Only rows for which the value is True will be selected.


2 Answers

shift

Get the year diffs as OP first did. Then check if equal to 1 or the previous value is 1

yd = df.Year.groupby(df.group).diff().eq(1)
df[yd | yd.shift(-1)]

   group  Year
0      A  2000
1      A  2001
2      A  2002
3      A  2003
5      A  2007
6      A  2008
7      A  2009
8      A  2010
9      A  2011
10     B  2005
11     B  2006
12     B  2007
15     B  2013
16     B  2014
17     B  2015
18     B  2016
19     B  2017

Setup

Thx jez

a = [('A',x) for x in range(2000, 2012) if x not in [2004,2006]]
b = [('B',x) for x in range(2005, 2018) if x not in [2008,2010,2012]]
df = pd.DataFrame(a + b, columns=['group','Year'])
like image 200
piRSquared Avatar answered Oct 16 '22 21:10

piRSquared


If I understand correctly, using diff and cumsum create the additional group key, then groupby it and your group columns, and drop the count equal to 1.

df[df.g.groupby([df.g,df.Year.diff().ne(1).cumsum()]).transform('count').ne(1)]

Out[317]:
    g  Year
0   A  2000
1   A  2001
2   A  2002
3   A  2003
5   A  2007
6   A  2008
7   A  2009
8   A  2010
9   A  2011
10  B  2005
11  B  2006
12  B  2007
15  B  2013
16  B  2014
17  B  2015
18  B  2016
19  B  2017

Data

df=pd.DataFrame({'g':list('AAAAAAAAAABBBBBBBBBB',
                 'Year':[2000,2001,2002,2003,2005,2007,2008,2009,2010,2011,2005,2006,2007,2009,2011,2013,2014,2015,2016,2017])]})
like image 29
BENY Avatar answered Oct 16 '22 22:10

BENY