Hi I have dataframe
that contains multiple rows for the same ID. One of the columns is a Date (in ascending order). I want to calculate the date difference between the first entry and last.
I am doing this by instantiating a pandas constructor as follows:
g = df.groupby('ID')
print(pd.DataFrame({'first':g.Date.nth(0), 'last':g.Date.nth(-1)}))
The first value is correct, however, the last value is nowhere near correct.
For example, for a specific id, the first date
is 2000-05-08
and the last date
is 8/21/2010
. The result outputted is:
first last
ID
31965.0 2000-05-08 2002-12-29
2002-12-29
is somewhere in the middle.
Sample Data:
ID Date
31965 5/8/2000
31965 5/10/2000
31965 5/18/2000
31965 5/22/2000
31965 5/23/2000
31965 5/25/2000
31965 5/30/2000
31965 6/7/2000
31965 6/8/2000
31965 6/11/2000
31965 6/13/2000
.....
31965 4/11/2009
31965 5/9/2009
31965 5/16/2009
31965 5/23/2009
31965 2/5/2010
31965 2/26/2010
31965 3/13/2010
31965 4/10/2010
31965 8/21/2010
I want my result for ID 31965 to be: 5/8/2000 and 8/21/2010 so that I can eventually work out the date difference.
To get the last row of each group, call last() after grouping.
The iloc() function in python is defined in the Pandas module that helps us to select a specific row or column from the data set. Using the iloc method in python, we can easily retrieve any particular value from a row or column by using index values.
How to perform groupby index in pandas? Pass index name of the DataFrame as a parameter to groupby() function to group rows on an index. DataFrame. groupby() function takes string or list as a param to specify the group columns or index.
Groupby preserves the order of rows within each group.
You can do this in one step, be sure your 'Date' column is dtype datetime,
df['Date'] = pd.to_datetime(df['Date'])
df.groupby('ID')['Date'].agg(['first','last'])
Now, I suspect maybe your data isn't order correctly, but if you still wanted to earliest and the latest date then you can do this:
df.groupby('ID')['Date'].agg(['min','max']).rename(columns={'min':'first','max':'last'})
Or you can use sort_values then:
df.sort_values('Date').groupby('ID')['Date'].agg(['first','last'])
You probably might have to parse the last date in this way:
import datetime
def parser(x):
return datetime.strptime(str(x), '%m/%d/%Y')
Here, you feed your date string into the function, and the function returns a parsed date. You can parse the first date similarly, and produce something consistent with the last date; the only thing you might need to change in the region %m/%d/%Y
. That should solve your problem. Read this page for more information: https://docs.python.org/2/library/datetime.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With