Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: How to create subindex efficiently?

I would like to create a subindex for my dataframe based on the index. For example, I have a dataframe like this:

      Content        Date
ID                       
Bob  birthday  2010.03.01
Bob    school  2010.04.01
Tom  shopping  2010.02.01
Tom      work  2010.09.01
Tom   holiday  2010.10.01

I'd like create a subindex for for my ID and the resulting dataframe looks like below:

               Content        Date
ID  subindex                      
Bob 1         birthday  2010.03.01
    2           school  2010.04.01
Tom 1         shopping  2010.02.01
    2             work  2010.09.01
    3          holiday  2010.10.01

To do this I need to first create my subindex list. I searched in the help document and it seems to most neat way is to use transform:

subindex = df['Date'].groupby(df.index).transform(lambda x: np.arange(1, len(x) + 1))

However, it is really slow. I looked around and found apply can do the work too:

subindex = df['Date'].groupby(df.index).apply(lambda x: np.arange(1, len(x) + 1))

Of course the subindex needed to be flattened since it is a list of lists here. This works much faster than the transform method. Then I tested with a for loop of my own:

subindex_size = df.groupby(df.index, sort = False).size()
subindex = []
for i in np.arange(len(subindex_size)):
    subindex.extend(np.arange(1,subindex_size[i]+1))

It's even faster. With my larger dataset (about 90k rows), the transform method takes about 44 secs on my computer, apply takes ~2 secs and the for loop takes only ~1 secs. I need to work on much larger dataset so even the time difference between the apply and for loop makes a difference to me. However, the for loop looks ugly and may not be easily applied if I need to create other group-based variables.

So my question is, why the built-in functions that are supposed to do the right thing are slower? Am I missing something here or is there a reason for this? Is there any other way to improve this process?

like image 477
Zhen Sun Avatar asked Mar 26 '14 20:03

Zhen Sun


People also ask

How do you simply make an operation on pandas DataFrame faster?

Using apply to loop over pandas dataframe A much better way to perform an operation on every row of a pandas dataframe is to use the apply method. In the piece of code below, we are replacing looping explicitly over rows with apply that goes every every row and applies the lambda expression we defined to add 1.

Which is the best way to get data in pandas?

pandas provides the read_csv() function to read data stored as a csv file into a pandas DataFrame . pandas supports many different file formats or data sources out of the box (csv, excel, sql, json, parquet, …), each of them with the prefix read_* .


1 Answers

You can use cumcount to do this:

In [11]: df.groupby(level=0).cumcount()
Out[11]: 
ID
Bob    0
Bob    1
Tom    0
Tom    1
Tom    2
dtype: int64

In [12]: df['subindex'] = df.groupby(level=0).cumcount()  # possibly + 1 here.

In [13]: df.set_index('subindex', append=True)
Out[13]: 
               Content        Date
ID  subindex                      
Bob 0         birthday  2010.03.01
    1           school  2010.04.01
Tom 0         shopping  2010.02.01
    1             work  2010.09.01
    2          holiday  2010.10.01

To start at 1 (rather than 0) just add 1 to the result of cumcount.

like image 148
Andy Hayden Avatar answered Sep 21 '22 12:09

Andy Hayden