Is this the fastest way to group in Pandas?

Tags:

The following code works well. Just checking: am I using and timing Pandas correctly and is there any faster way? Thanks.

$ python3 Python 3.4.0 (default, Apr 11 2014, 13:05:11)  [GCC 4.8.2] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pandas as pd >>> import numpy as np >>> import timeit >>> pd.__version__ '0.14.1'  def randChar(f, numGrp, N) :    things = [f%x for x in range(numGrp)]    return [things[x] for x in np.random.choice(numGrp, N)]  def randFloat(numGrp, N) :    things = [round(100*np.random.random(),4) for x in range(numGrp)]    return [things[x] for x in np.random.choice(numGrp, N)]  N=int(1e8) K=100 DF = pd.DataFrame({   'id1' : randChar("id%03d", K, N),       # large groups (char)   'id2' : randChar("id%03d", K, N),       # large groups (char)   'id3' : randChar("id%010d", N//K, N),   # small groups (char)   'id4' : np.random.choice(K, N),         # large groups (int)   'id5' : np.random.choice(K, N),         # large groups (int)   'id6' : np.random.choice(N//K, N),      # small groups (int)               'v1' :  np.random.choice(5, N),         # int in range [1,5]   'v2' :  np.random.choice(5, N),         # int in range [1,5]   'v3' :  randFloat(100,N)                # numeric e.g. 23.5749 })

Now time 5 different groupings, repeating each one twice to confirm the timing. [I realise timeit(2) runs it twice, but then it reports the total. I'm interested in the time of the first and second run separately.] Python uses about 10G of RAM according to htop during these tests.

>>> timeit.Timer("DF.groupby(['id1']).agg({'v1':'sum'})"                            ,"from __main__ import DF").timeit(1) 5.604133386000285 >>> timeit.Timer("DF.groupby(['id1']).agg({'v1':'sum'})"                            ,"from __main__ import DF").timeit(1) 5.505057081000359  >>> timeit.Timer("DF.groupby(['id1','id2']).agg({'v1':'sum'})"                      ,"from __main__ import DF").timeit(1) 14.232032927000091 >>> timeit.Timer("DF.groupby(['id1','id2']).agg({'v1':'sum'})"                      ,"from __main__ import DF").timeit(1) 14.242601240999647  >>> timeit.Timer("DF.groupby(['id3']).agg({'v1':'sum', 'v3':'mean'})"               ,"from __main__ import DF").timeit(1) 22.87025260900009 >>> timeit.Timer("DF.groupby(['id3']).agg({'v1':'sum', 'v3':'mean'})"               ,"from __main__ import DF").timeit(1) 22.393589012999655  >>> timeit.Timer("DF.groupby(['id4']).agg({'v1':'mean', 'v2':'mean', 'v3':'mean'})" ,"from __main__ import DF").timeit(1) 2.9725865330001398 >>> timeit.Timer("DF.groupby(['id4']).agg({'v1':'mean', 'v2':'mean', 'v3':'mean'})" ,"from __main__ import DF").timeit(1) 2.9683854739996605  >>> timeit.Timer("DF.groupby(['id6']).agg({'v1':'sum', 'v2':'sum', 'v3':'sum'})"    ,"from __main__ import DF").timeit(1) 12.776488024999708 >>> timeit.Timer("DF.groupby(['id6']).agg({'v1':'sum', 'v2':'sum', 'v3':'sum'})"    ,"from __main__ import DF").timeit(1) 13.558292575999076

Here is system info :

$ lscpu Architecture:          x86_64 CPU op-mode(s):        32-bit, 64-bit Byte Order:            Little Endian CPU(s):                32 On-line CPU(s) list:   0-31 Thread(s) per core:    2 Core(s) per socket:    8 Socket(s):             2 NUMA node(s):          2 Vendor ID:             GenuineIntel CPU family:            6 Model:                 62 Stepping:              4 CPU MHz:               2500.048 BogoMIPS:              5066.38 Hypervisor vendor:     Xen Virtualization type:   full L1d cache:             32K L1i cache:             32K L2 cache:              256K L3 cache:              25600K NUMA node0 CPU(s):     0-7,16-23 NUMA node1 CPU(s):     8-15,24-31  $ free -h              total       used       free     shared    buffers     cached Mem:          240G        74G       166G       372K        33M       550M -/+ buffers/cache:        73G       166G Swap:           0B         0B         0B

I don't believe it's relevant but just in case, the randChar function above is a workaround for a memory error in mtrand.RandomState.choice :

How to solve memory error in mtrand.RandomState.choice?

844

asked Sep 02 '14 19:09

Matt Dowle

1 Answers

If you'd like to install the iPython shell, you can easily time your code using %timeit. After installing it, instead of typing python to launch the python interpreter, you would type ipython.

Then you can type your code exactly as you would type it in the normal interpreter (as you did above).

Then you can type, for example:

%timeit DF.groupby(['id1']).agg({'v1':'sum'})

This will accomplish exactly the same thing as what you've done, but if you're using python a lot I find that this will save you significant typing time :).

Ipython has a lot of other nice features (like %paste, which I used to paste in your code and test this, or %run to run a script you've saved in a file), tab completion, etc. http://ipython.org/

197

answered Oct 18 '22 00:10

tegan

Related questions
                            
                                Python: Is there an equivalent of mid, right, and left from BASIC?
                            
                                Check if a parameter is a Python module?
                            
                                PermissionError: [Errno 13] Permission denied
                            
                                Is there a method that tells my program to quit?
                            
                                How do chained assignments work?
                            
                                In Python, how do I decode GZIP encoding?
                            
                                Scrapy and proxies
                            
                                matplotlib iterate subplot axis array through single list
                            
                                Converting LinearSVC's decision function to probabilities (Scikit learn python )
                            
                                Graphviz's executables are not found (Python 3.4)
                            
                                How to remove square brackets from list in Python? [duplicate]
                            
                                how to declare variable type, C style in python
                            
                                Fibonacci numbers, with an one-liner in Python 3?
                            
                                How to calculate next Friday?
                            
                                Python "protected" attributes
                            
                                How to find the average colour of an image in Python with OpenCV?
                            
                                Install mysql-python (Windows)
                            
                                Shorter, more pythonic way of writing an if statement
                            
                                What do you wish you'd known about when you started learning Python? [closed]
                            
                                Swapping 1 with 0 and 0 with 1 in a Pythonic way

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is this the fastest way to group in Pandas?

Tags:

performance

python

pandas

numpy

Matt Dowle

People also ask

1 Answers

tegan

Recent Activity

Donate For Us