Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas: How to keep the last `n` records of each group sorted by another variable?

I want to keep the last n rows of each group sorted by a variable var_to_sort using pandas.

This is how I would do it now, I want to group the below dataframe by name and then sort by date and then use tail(n) to get the last n elements within in by-group.

data = [
    ['tom', date(2018,2,1), "I want this"],
    ['tom', date(2018,1,1), "Don't want"],
    ['nick', date(2019,4,1), "Don't want"],
    ['nick', date(2019,5,1), "I want this"]]

# Create the pandas DataFrame
df = pd.DataFrame(data)
df.columns = ["names", "date", "result"]

# sort it
df.sort_values("date", inplace=True)

df.groupby("names").tail(1)

Is there a more efficient way to do this? What if the dataset is indexed by "date" or by ["date", "name"] already?

like image 407
xiaodai Avatar asked Aug 19 '19 05:08

xiaodai


1 Answers

I think your solution is nice and good, also is possible use sort_values without inplace for chain code together.

For another questions:

data = [
    ['tom', date(2018,2,1), "I want this"],
    ['tom', date(2018,1,1), "Don't want"],
    ['nick', date(2019,4,1), "Don't want"],
    ['nick', date(2019,5,1), "I want this"]]

# Create the pandas DataFrame
df = pd.DataFrame(data)
df.columns = ["names", "date", "result"]

df1 = df.sort_values("date").groupby("names").tail(1)
print (df1)
  names        date       result
0   tom  2018-02-01  I want this
3  nick  2019-05-01  I want this

df2 = df.set_index('date')
print (df2)
           names       result
date                         
2018-02-01   tom  I want this
2018-01-01   tom   Don't want
2019-04-01  nick   Don't want
2019-05-01  nick  I want this

df22 = df2.sort_index().groupby("names").tail(1)
print (df22)
           names       result
date                         
2018-02-01   tom  I want this
2019-05-01  nick  I want this

df3 = df.set_index(['date','names'])
print (df3)
                       result
date       names             
2018-02-01 tom    I want this
2018-01-01 tom     Don't want
2019-04-01 nick    Don't want
2019-05-01 nick   I want this

df33 = df3.sort_index().groupby(level=1).tail(1)
print (df33)
                       result
date       names             
2018-02-01 tom    I want this
2019-05-01 nick   I want this
like image 50
jezrael Avatar answered Oct 16 '22 11:10

jezrael