I am stuck on what is probably a very simple problem. My data looks something like this:
| id_not_unique | datetime | seconds |
|---|---|---|
| 111111111 | 2020-08-26 15:44:58 | 122 |
| 111111111 | 2020-08-28 15:33:45 | 34 |
| 222222222 | 2020-07-12 11:21:09 | 26 |
| 222222222 | 2019-04-21 14:22:42 | 57 |
So I want to group by id_not_unique and then add a new column called time that is just finding the minimum datetime in the group and then return the corresponding seconds. So the result would look something like:
| id_not_unique | time |
|---|---|
| 111111111 | 122 |
| 222222222 | 57 |
I have tried this:
def wait_first_call(df):
min_time = min(df['datetime'])
idx = df.index[df.datetime == min_time]
first_time = df['seconds'].iloc[idx]
return first_time
then
df.groupby(['id_not_unique']).apply(wait_first_call)
But I keep getting an "IndexError: positional indexers are out-of-bounds". So I am not understanding why I am getting this error -- I thought the apply function took each group as a dataframe and applies the function to this group?
Any suggestions/help would be greatly appreciated.
There are few issues with your code:
.iloc is for indexing using the row_number, column_number. You
should not use index in .iloc, If one of your index is greater
than the number of rows in the dataframe then it will throw error.
That's what is happening in your case. You can resolve this error
by using .loc which takes index.df['datetime'].idxmin() to get the index corresponding to minimum b
value. Problem with df.index[df.datetime == min_time] is that it
returns a list of indices even if there is only one match (format: Index([ind]) and when you use this to index dataframe as in
df['seconds'].loc[idx], it gives you a series which we don't need.Use this code snippet:
def wait_first_call(df):
idx = df['datetime'].idxmin()
first_time = df['seconds'].loc[idx]
return pd.Series({'time': first_time}, index = ['time'])
df.groupby(['id_not_unique'], as_index = False).apply(wait_first_call)
Update: Without Using .apply
I have found apply to be usually slow and hence I am posting another approach to solve:
First get the min_idx for each group in another column.
df['min_idx'] = (df.groupby('id_not_unique')['datetime']
.transform(lambda x: x.idxmin()))
Now Filter the dataframe where index is eq to min index.
new_df = (df[df.index == df.min_idx]
[['id_not_unique', 'seconds']]
.rename(columns = {'seconds': 'time'}))
Use DataFrame.sort_values with DataFrame.drop_duplicates with remove column datetime with rename:
df['datetime'] = pd.to_datetime(df['datetime'])
df = (df.sort_values(['id_not_unique','datetime'])
.drop_duplicates('id_not_unique')
.drop('datetime', axis=1)
.rename(columns = {'seconds': 'time'}))
print (df)
id_not_unique time
0 111111111 122
3 222222222 57
Or use DataFrameGroupBy.idxmin for index by minimal datetime and select by DataFrame.loc:
df['datetime'] = pd.to_datetime(df['datetime'])
df = (df.loc[df.groupby('id_not_unique')['datetime'].idxmin()]
.drop('datetime', axis=1)
.rename(columns = {'seconds': 'time'}))
print (df)
id_not_unique time
0 111111111 122
3 222222222 57
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With