Pandas: How to create subindex efficiently?

Tags:

I would like to create a subindex for my dataframe based on the index. For example, I have a dataframe like this:

      Content        Date
ID                       
Bob  birthday  2010.03.01
Bob    school  2010.04.01
Tom  shopping  2010.02.01
Tom      work  2010.09.01
Tom   holiday  2010.10.01

I'd like create a subindex for for my ID and the resulting dataframe looks like below:

               Content        Date
ID  subindex                      
Bob 1         birthday  2010.03.01
    2           school  2010.04.01
Tom 1         shopping  2010.02.01
    2             work  2010.09.01
    3          holiday  2010.10.01

To do this I need to first create my subindex list. I searched in the help document and it seems to most neat way is to use transform:

subindex = df['Date'].groupby(df.index).transform(lambda x: np.arange(1, len(x) + 1))

However, it is really slow. I looked around and found apply can do the work too:

subindex = df['Date'].groupby(df.index).apply(lambda x: np.arange(1, len(x) + 1))

Of course the subindex needed to be flattened since it is a list of lists here. This works much faster than the transform method. Then I tested with a for loop of my own:

subindex_size = df.groupby(df.index, sort = False).size()
subindex = []
for i in np.arange(len(subindex_size)):
    subindex.extend(np.arange(1,subindex_size[i]+1))

It's even faster. With my larger dataset (about 90k rows), the transform method takes about 44 secs on my computer, apply takes ~2 secs and the for loop takes only ~1 secs. I need to work on much larger dataset so even the time difference between the apply and for loop makes a difference to me. However, the for loop looks ugly and may not be easily applied if I need to create other group-based variables.

So my question is, why the built-in functions that are supposed to do the right thing are slower? Am I missing something here or is there a reason for this? Is there any other way to improve this process?

477

asked Mar 26 '14 20:03

Zhen Sun

1 Answers

You can use cumcount to do this:

In [11]: df.groupby(level=0).cumcount()
Out[11]: 
ID
Bob    0
Bob    1
Tom    0
Tom    1
Tom    2
dtype: int64

In [12]: df['subindex'] = df.groupby(level=0).cumcount()  # possibly + 1 here.

In [13]: df.set_index('subindex', append=True)
Out[13]: 
               Content        Date
ID  subindex                      
Bob 0         birthday  2010.03.01
    1           school  2010.04.01
Tom 0         shopping  2010.02.01
    1             work  2010.09.01
    2          holiday  2010.10.01

To start at 1 (rather than 0) just add 1 to the result of cumcount.

148

answered Sep 21 '22 12:09

Andy Hayden

Related questions
                            
                                Cascade Classifiers for Multiclass Problems in scikit-learn
                            
                                Difference between curve_fit and leastsq in python from scipy.optimize
                            
                                concurrent.futures.ThreadPoolExecutor.map is slower than a for loop
                            
                                Has anyone tried to use IPython with the Hylang syntax?
                            
                                Start ipython qtconsole as interactive interpreter after script execution
                            
                                Given an unsorted python list, how can I find the minimum set of movements required to sort it
                            
                                Contour plot from data in a vtk file using Python
                            
                                Does ndb.toplevel break transactions?
                            
                                Python Mock - How to get the return of a MagicMock as if it was a normal method
                            
                                Python - os.path doesn't exist: AttributeError: 'module' object has no attribute 'path'
                            
                                Is it possible to run Python's scikit-learn algorithms over Hadoop? [closed]
                            
                                Underlying datastructure of list, tuple, dict [closed]
                            
                                How to use a map with *args to unpack a tuple in a python function call
                            
                                python pandas datetime.time - datetime.time
                            
                                Is it possible run python in .NET application?
                            
                                How does the uwsgi spooler work?
                            
                                How can I run the current line (not the whole file) in Sublime Text 2?
                            
                                Gtk* backend requires pygtk to be installed
                            
                                Haskell equivalent of this Python code
                            
                                Multiple concurrent database transactions with Django?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas: How to create subindex efficiently?

Tags:

performance

python

pandas

Zhen Sun

People also ask

1 Answers

Andy Hayden

Recent Activity

Donate For Us