I would like to create a subindex for my dataframe based on the index. For example, I have a dataframe like this:
      Content        Date
ID                       
Bob  birthday  2010.03.01
Bob    school  2010.04.01
Tom  shopping  2010.02.01
Tom      work  2010.09.01
Tom   holiday  2010.10.01
I'd like create a subindex for for my ID and the resulting dataframe looks like below: 
               Content        Date
ID  subindex                      
Bob 1         birthday  2010.03.01
    2           school  2010.04.01
Tom 1         shopping  2010.02.01
    2             work  2010.09.01
    3          holiday  2010.10.01
To do this I need to first create my subindex list. I searched in the help document and it seems to most neat way is to use transform:
subindex = df['Date'].groupby(df.index).transform(lambda x: np.arange(1, len(x) + 1))
However, it is really slow. I looked around and found apply can do the work too: 
subindex = df['Date'].groupby(df.index).apply(lambda x: np.arange(1, len(x) + 1))
Of course the subindex needed to be flattened since it is a list of lists here. This works much faster than the transform method. Then I tested with a for loop of my own: 
subindex_size = df.groupby(df.index, sort = False).size()
subindex = []
for i in np.arange(len(subindex_size)):
    subindex.extend(np.arange(1,subindex_size[i]+1))
It's even faster. With my larger dataset (about 90k rows), the transform method takes about 44 secs on my computer, apply takes ~2 secs and the for loop takes only ~1 secs. I need to work on much larger dataset so even the time difference between the apply and for loop makes a difference to me. However, the for loop looks ugly and may not be easily applied if I need to create other group-based variables. 
So my question is, why the built-in functions that are supposed to do the right thing are slower? Am I missing something here or is there a reason for this? Is there any other way to improve this process?
Using apply to loop over pandas dataframe A much better way to perform an operation on every row of a pandas dataframe is to use the apply method. In the piece of code below, we are replacing looping explicitly over rows with apply that goes every every row and applies the lambda expression we defined to add 1.
pandas provides the read_csv() function to read data stored as a csv file into a pandas DataFrame . pandas supports many different file formats or data sources out of the box (csv, excel, sql, json, parquet, …), each of them with the prefix read_* .
You can use cumcount to do this:
In [11]: df.groupby(level=0).cumcount()
Out[11]: 
ID
Bob    0
Bob    1
Tom    0
Tom    1
Tom    2
dtype: int64
In [12]: df['subindex'] = df.groupby(level=0).cumcount()  # possibly + 1 here.
In [13]: df.set_index('subindex', append=True)
Out[13]: 
               Content        Date
ID  subindex                      
Bob 0         birthday  2010.03.01
    1           school  2010.04.01
Tom 0         shopping  2010.02.01
    1             work  2010.09.01
    2          holiday  2010.10.01
To start at 1 (rather than 0) just add 1 to the result of cumcount.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With