Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is this the fastest way to group in Pandas?

The following code works well. Just checking: am I using and timing Pandas correctly and is there any faster way? Thanks.

$ python3 Python 3.4.0 (default, Apr 11 2014, 13:05:11)  [GCC 4.8.2] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pandas as pd >>> import numpy as np >>> import timeit >>> pd.__version__ '0.14.1'  def randChar(f, numGrp, N) :    things = [f%x for x in range(numGrp)]    return [things[x] for x in np.random.choice(numGrp, N)]  def randFloat(numGrp, N) :    things = [round(100*np.random.random(),4) for x in range(numGrp)]    return [things[x] for x in np.random.choice(numGrp, N)]  N=int(1e8) K=100 DF = pd.DataFrame({   'id1' : randChar("id%03d", K, N),       # large groups (char)   'id2' : randChar("id%03d", K, N),       # large groups (char)   'id3' : randChar("id%010d", N//K, N),   # small groups (char)   'id4' : np.random.choice(K, N),         # large groups (int)   'id5' : np.random.choice(K, N),         # large groups (int)   'id6' : np.random.choice(N//K, N),      # small groups (int)               'v1' :  np.random.choice(5, N),         # int in range [1,5]   'v2' :  np.random.choice(5, N),         # int in range [1,5]   'v3' :  randFloat(100,N)                # numeric e.g. 23.5749 }) 

Now time 5 different groupings, repeating each one twice to confirm the timing. [I realise timeit(2) runs it twice, but then it reports the total. I'm interested in the time of the first and second run separately.] Python uses about 10G of RAM according to htop during these tests.

>>> timeit.Timer("DF.groupby(['id1']).agg({'v1':'sum'})"                            ,"from __main__ import DF").timeit(1) 5.604133386000285 >>> timeit.Timer("DF.groupby(['id1']).agg({'v1':'sum'})"                            ,"from __main__ import DF").timeit(1) 5.505057081000359  >>> timeit.Timer("DF.groupby(['id1','id2']).agg({'v1':'sum'})"                      ,"from __main__ import DF").timeit(1) 14.232032927000091 >>> timeit.Timer("DF.groupby(['id1','id2']).agg({'v1':'sum'})"                      ,"from __main__ import DF").timeit(1) 14.242601240999647  >>> timeit.Timer("DF.groupby(['id3']).agg({'v1':'sum', 'v3':'mean'})"               ,"from __main__ import DF").timeit(1) 22.87025260900009 >>> timeit.Timer("DF.groupby(['id3']).agg({'v1':'sum', 'v3':'mean'})"               ,"from __main__ import DF").timeit(1) 22.393589012999655  >>> timeit.Timer("DF.groupby(['id4']).agg({'v1':'mean', 'v2':'mean', 'v3':'mean'})" ,"from __main__ import DF").timeit(1) 2.9725865330001398 >>> timeit.Timer("DF.groupby(['id4']).agg({'v1':'mean', 'v2':'mean', 'v3':'mean'})" ,"from __main__ import DF").timeit(1) 2.9683854739996605  >>> timeit.Timer("DF.groupby(['id6']).agg({'v1':'sum', 'v2':'sum', 'v3':'sum'})"    ,"from __main__ import DF").timeit(1) 12.776488024999708 >>> timeit.Timer("DF.groupby(['id6']).agg({'v1':'sum', 'v2':'sum', 'v3':'sum'})"    ,"from __main__ import DF").timeit(1) 13.558292575999076 

Here is system info :

$ lscpu Architecture:          x86_64 CPU op-mode(s):        32-bit, 64-bit Byte Order:            Little Endian CPU(s):                32 On-line CPU(s) list:   0-31 Thread(s) per core:    2 Core(s) per socket:    8 Socket(s):             2 NUMA node(s):          2 Vendor ID:             GenuineIntel CPU family:            6 Model:                 62 Stepping:              4 CPU MHz:               2500.048 BogoMIPS:              5066.38 Hypervisor vendor:     Xen Virtualization type:   full L1d cache:             32K L1i cache:             32K L2 cache:              256K L3 cache:              25600K NUMA node0 CPU(s):     0-7,16-23 NUMA node1 CPU(s):     8-15,24-31  $ free -h              total       used       free     shared    buffers     cached Mem:          240G        74G       166G       372K        33M       550M -/+ buffers/cache:        73G       166G Swap:           0B         0B         0B 

I don't believe it's relevant but just in case, the randChar function above is a workaround for a memory error in mtrand.RandomState.choice :

How to solve memory error in mtrand.RandomState.choice?

like image 844
Matt Dowle Avatar asked Sep 02 '14 19:09

Matt Dowle


People also ask

What is faster than pandas DataFrame?

Dask runs faster than pandas for this query, even when the most inefficient column type is used, because it parallelizes the computations. pandas only uses 1 CPU core to run the query. My computer has 4 cores and Dask uses all the cores to run the computation.

Is pandas query faster than LOC?

The query function seams more efficient than the loc function. DF2: 2K records x 6 columns. The loc function seams much more efficient than the query function.

How do you make apply faster in pandas?

You can speed up the execution even faster by using another trick: making your pandas' dataframes lighter by using more efficent data types. As we know that df only contains integers from 1 to 10, we can then reduce the data type from 64 bits to 16 bits. See how we reduced the size of our dataframe from 38MB to 9.5MB.


1 Answers

If you'd like to install the iPython shell, you can easily time your code using %timeit. After installing it, instead of typing python to launch the python interpreter, you would type ipython.

Then you can type your code exactly as you would type it in the normal interpreter (as you did above).

Then you can type, for example:

%timeit DF.groupby(['id1']).agg({'v1':'sum'}) 

This will accomplish exactly the same thing as what you've done, but if you're using python a lot I find that this will save you significant typing time :).

Ipython has a lot of other nice features (like %paste, which I used to paste in your code and test this, or %run to run a script you've saved in a file), tab completion, etc. http://ipython.org/

like image 197
tegan Avatar answered Oct 18 '22 00:10

tegan