Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to add a column after using python groupby if the function is also a custom function?

I am stuck on what is probably a very simple problem. My data looks something like this:

id_not_unique datetime seconds
111111111 2020-08-26 15:44:58 122
111111111 2020-08-28 15:33:45 34
222222222 2020-07-12 11:21:09 26
222222222 2019-04-21 14:22:42 57

So I want to group by id_not_unique and then add a new column called time that is just finding the minimum datetime in the group and then return the corresponding seconds. So the result would look something like:

id_not_unique time
111111111 122
222222222 57

I have tried this:

def wait_first_call(df):
    min_time = min(df['datetime'])

    idx = df.index[df.datetime == min_time]
    
    first_time = df['seconds'].iloc[idx]
    
    return first_time

then

df.groupby(['id_not_unique']).apply(wait_first_call)

But I keep getting an "IndexError: positional indexers are out-of-bounds". So I am not understanding why I am getting this error -- I thought the apply function took each group as a dataframe and applies the function to this group?

Any suggestions/help would be greatly appreciated.

like image 852
confused_donkey Avatar asked Jan 30 '26 20:01

confused_donkey


2 Answers

There are few issues with your code:

  1. .iloc is for indexing using the row_number, column_number. You should not use index in .iloc, If one of your index is greater than the number of rows in the dataframe then it will throw error. That's what is happening in your case. You can resolve this error by using .loc which takes index.
  2. Use df['datetime'].idxmin() to get the index corresponding to minimum b value. Problem with df.index[df.datetime == min_time] is that it returns a list of indices even if there is only one match (format: Index([ind]) and when you use this to index dataframe as in df['seconds'].loc[idx], it gives you a series which we don't need.

Use this code snippet:

def wait_first_call(df):
      idx = df['datetime'].idxmin()
      first_time = df['seconds'].loc[idx]
      return pd.Series({'time': first_time}, index = ['time'])

df.groupby(['id_not_unique'], as_index = False).apply(wait_first_call)

Update: Without Using .apply

I have found apply to be usually slow and hence I am posting another approach to solve:

First get the min_idx for each group in another column.

df['min_idx'] = (df.groupby('id_not_unique')['datetime']
                 .transform(lambda x: x.idxmin()))

Now Filter the dataframe where index is eq to min index.

 new_df = (df[df.index == df.min_idx]
             [['id_not_unique', 'seconds']]
             .rename(columns = {'seconds': 'time'}))
like image 191
Amit Vikram Singh Avatar answered Feb 02 '26 09:02

Amit Vikram Singh


Use DataFrame.sort_values with DataFrame.drop_duplicates with remove column datetime with rename:

df['datetime'] = pd.to_datetime(df['datetime'])

df = (df.sort_values(['id_not_unique','datetime'])
        .drop_duplicates('id_not_unique')
        .drop('datetime', axis=1)
        .rename(columns = {'seconds': 'time'}))
print (df)
   id_not_unique     time
0      111111111      122
3      222222222       57

Or use DataFrameGroupBy.idxmin for index by minimal datetime and select by DataFrame.loc:

df['datetime'] = pd.to_datetime(df['datetime'])

df = (df.loc[df.groupby('id_not_unique')['datetime'].idxmin()]
       .drop('datetime', axis=1)
       .rename(columns = {'seconds': 'time'}))
print (df)
   id_not_unique     time
0      111111111      122
3      222222222       57
like image 38
jezrael Avatar answered Feb 02 '26 08:02

jezrael