I have a dataframe that looks like the this:
I want to keep only the consecutive years in each group, such as the following figure where the year of 2005 in group A and year of 2009 and 2011 in group B are deleted.
I created a column of the year difference by using df['year_diff']=df.groupby(['group'])['Year'].diff()
, and then only kept the rows where the year difference was equal to 1.
However, this method will also delete the first row in each consecutive year group since the year difference of the first row will be NAN. For example, the year of 2000 will be deleted from group 2000-2005. Is there a way that I can do to avoid this problem?
You can get unique values in column (multiple columns) from pandas DataFrame using unique() or Series. unique() functions. unique() from Series is used to get unique values from a single column and the other one is used to get from multiple columns.
By using pandas. DataFrame. drop() method you can drop/remove/delete rows from DataFrame.
Such a Series of boolean values can be used to filter the DataFrame by putting it in between the selection brackets [] . Only rows for which the value is True will be selected.
shift
Get the year diffs as OP first did. Then check if equal to 1
or the previous value is 1
yd = df.Year.groupby(df.group).diff().eq(1)
df[yd | yd.shift(-1)]
group Year
0 A 2000
1 A 2001
2 A 2002
3 A 2003
5 A 2007
6 A 2008
7 A 2009
8 A 2010
9 A 2011
10 B 2005
11 B 2006
12 B 2007
15 B 2013
16 B 2014
17 B 2015
18 B 2016
19 B 2017
Thx jez
a = [('A',x) for x in range(2000, 2012) if x not in [2004,2006]]
b = [('B',x) for x in range(2005, 2018) if x not in [2008,2010,2012]]
df = pd.DataFrame(a + b, columns=['group','Year'])
If I understand correctly, using diff
and cumsum
create the additional group key, then groupby
it and your group columns, and drop the count
equal to 1.
df[df.g.groupby([df.g,df.Year.diff().ne(1).cumsum()]).transform('count').ne(1)]
Out[317]:
g Year
0 A 2000
1 A 2001
2 A 2002
3 A 2003
5 A 2007
6 A 2008
7 A 2009
8 A 2010
9 A 2011
10 B 2005
11 B 2006
12 B 2007
15 B 2013
16 B 2014
17 B 2015
18 B 2016
19 B 2017
Data
df=pd.DataFrame({'g':list('AAAAAAAAAABBBBBBBBBB',
'Year':[2000,2001,2002,2003,2005,2007,2008,2009,2010,2011,2005,2006,2007,2009,2011,2013,2014,2015,2016,2017])]})
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With