Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get date quantiles in pandas

I have some data that looks something like:

user  timestamp  value1 
   a 2007-01-01       7 
   a 2007-02-02       8 
   a 2007-02-03       9 
   b 2007-02-04       1 
   a 2007-02-05       2 
   b 2007-02-06       3 
   b 2007-02-07       4 
   a 2007-02-08       5 
...

where each user has a different number of entries.

My goal is to have a sense of how quickly these entries are being produced, and output something like:

     last_entry median_entry first_entry
user                                    
a    2007-02-08   2007-02-03  2007-01-01
b    2007-02-07   2007-02-06  2007-02-04

So far, I have code like:

gb = df.groupby('user')
time_median = gb['timestamp'].median()

But this gives me DataError: No numeric types to aggregate, presumably because dates are not numeric.

I guess it would be possible to turn the dates into timestamps, and find the median of those, and then turn them back into date_time objects. Is that the best way?

like image 648
Jeremy Avatar asked Apr 28 '16 21:04

Jeremy


2 Answers

I may have been unclear in my question, but I found a solution that works for me.

def get_quantile(df, q):
    # Function that gets quantile from integer timestamp, then changes
    # back to a date_time object
    return pd.to_datetime(df['timestamp'].quantile(q, interpolation='nearest'))

df = pd.DataFrame(data={'user': np.random.choice(['a', 'b','c'], size=100, replace=True), 'value': np.random.random(size=100), 'date_time': pd.date_range(start=date(2016, 1,1), freq='D', periods=100)})

# Make a column of integer timestamps
df['timestamp'] = df['date_time'].astype('int')

editors = d.groupby('editor')

result = pd.DataFrame()
# Add columns of quantiles
result['first_quantile'] = get_quantile(editors, .25)
etc.
like image 126
Jeremy Avatar answered Oct 13 '22 20:10

Jeremy


Assuming you want to treat each date after the initial date for each user as the number of days since this initial date, you could do something like

import pandas as pd
dts =  pd.date_range(start="2015-01-15", periods=20)
users = ["a","b"]*10
df = pd.DataFrame({"user":users, "timestamp":dts})

date_info = df.groupby("user").agg({"timestamp":[min, max]})
date_info.columns = date_info.columns.droplevel()

since_incept = lambda x: x - x.min()
df["days"] = df.groupby("user").transform(since_incept)
df["days"] = df["days"].dt.days

median_td = lambda x: pd.Timedelta(pd.Series.median(x), "D")
med = df.groupby("user").agg({"days":[median_td]})

date_info["median"] = date_info["min"] + med.loc[:, ("days", "<lambda>")]
like image 35
mgilbert Avatar answered Oct 13 '22 19:10

mgilbert