I have some data that looks something like:
user timestamp value1
a 2007-01-01 7
a 2007-02-02 8
a 2007-02-03 9
b 2007-02-04 1
a 2007-02-05 2
b 2007-02-06 3
b 2007-02-07 4
a 2007-02-08 5
...
where each user has a different number of entries.
My goal is to have a sense of how quickly these entries are being produced, and output something like:
last_entry median_entry first_entry
user
a 2007-02-08 2007-02-03 2007-01-01
b 2007-02-07 2007-02-06 2007-02-04
So far, I have code like:
gb = df.groupby('user')
time_median = gb['timestamp'].median()
But this gives me DataError: No numeric types to aggregate
, presumably because dates are not numeric.
I guess it would be possible to turn the dates into timestamps, and find the median of those, and then turn them back into date_time objects. Is that the best way?
I may have been unclear in my question, but I found a solution that works for me.
def get_quantile(df, q):
# Function that gets quantile from integer timestamp, then changes
# back to a date_time object
return pd.to_datetime(df['timestamp'].quantile(q, interpolation='nearest'))
df = pd.DataFrame(data={'user': np.random.choice(['a', 'b','c'], size=100, replace=True), 'value': np.random.random(size=100), 'date_time': pd.date_range(start=date(2016, 1,1), freq='D', periods=100)})
# Make a column of integer timestamps
df['timestamp'] = df['date_time'].astype('int')
editors = d.groupby('editor')
result = pd.DataFrame()
# Add columns of quantiles
result['first_quantile'] = get_quantile(editors, .25)
etc.
Assuming you want to treat each date after the initial date for each user as the number of days since this initial date, you could do something like
import pandas as pd
dts = pd.date_range(start="2015-01-15", periods=20)
users = ["a","b"]*10
df = pd.DataFrame({"user":users, "timestamp":dts})
date_info = df.groupby("user").agg({"timestamp":[min, max]})
date_info.columns = date_info.columns.droplevel()
since_incept = lambda x: x - x.min()
df["days"] = df.groupby("user").transform(since_incept)
df["days"] = df["days"].dt.days
median_td = lambda x: pd.Timedelta(pd.Series.median(x), "D")
med = df.groupby("user").agg({"days":[median_td]})
date_info["median"] = date_info["min"] + med.loc[:, ("days", "<lambda>")]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With