I would like to create a subindex for my dataframe based on the index. For example, I have a dataframe like this:
Content Date
ID
Bob birthday 2010.03.01
Bob school 2010.04.01
Tom shopping 2010.02.01
Tom work 2010.09.01
Tom holiday 2010.10.01
I'd like create a subindex for for my ID
and the resulting dataframe looks like below:
Content Date
ID subindex
Bob 1 birthday 2010.03.01
2 school 2010.04.01
Tom 1 shopping 2010.02.01
2 work 2010.09.01
3 holiday 2010.10.01
To do this I need to first create my subindex
list. I searched in the help document and it seems to most neat way is to use transform
:
subindex = df['Date'].groupby(df.index).transform(lambda x: np.arange(1, len(x) + 1))
However, it is really slow. I looked around and found apply
can do the work too:
subindex = df['Date'].groupby(df.index).apply(lambda x: np.arange(1, len(x) + 1))
Of course the subindex
needed to be flattened since it is a list of lists here. This works much faster than the transform
method. Then I tested with a for loop
of my own:
subindex_size = df.groupby(df.index, sort = False).size()
subindex = []
for i in np.arange(len(subindex_size)):
subindex.extend(np.arange(1,subindex_size[i]+1))
It's even faster. With my larger dataset (about 90k rows), the transform
method takes about 44 secs on my computer, apply
takes ~2 secs and the for loop
takes only ~1 secs. I need to work on much larger dataset so even the time difference between the apply
and for loop
makes a difference to me. However, the for loop
looks ugly and may not be easily applied if I need to create other group-based variables.
So my question is, why the built-in functions that are supposed to do the right thing are slower? Am I missing something here or is there a reason for this? Is there any other way to improve this process?
Using apply to loop over pandas dataframe A much better way to perform an operation on every row of a pandas dataframe is to use the apply method. In the piece of code below, we are replacing looping explicitly over rows with apply that goes every every row and applies the lambda expression we defined to add 1.
pandas provides the read_csv() function to read data stored as a csv file into a pandas DataFrame . pandas supports many different file formats or data sources out of the box (csv, excel, sql, json, parquet, …), each of them with the prefix read_* .
You can use cumcount to do this:
In [11]: df.groupby(level=0).cumcount()
Out[11]:
ID
Bob 0
Bob 1
Tom 0
Tom 1
Tom 2
dtype: int64
In [12]: df['subindex'] = df.groupby(level=0).cumcount() # possibly + 1 here.
In [13]: df.set_index('subindex', append=True)
Out[13]:
Content Date
ID subindex
Bob 0 birthday 2010.03.01
1 school 2010.04.01
Tom 0 shopping 2010.02.01
1 work 2010.09.01
2 holiday 2010.10.01
To start at 1 (rather than 0) just add 1 to the result of cumcount.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With