The following code works well. Just checking: am I using and timing Pandas correctly and is there any faster way? Thanks.
$ python3 Python 3.4.0 (default, Apr 11 2014, 13:05:11) [GCC 4.8.2] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pandas as pd >>> import numpy as np >>> import timeit >>> pd.__version__ '0.14.1' def randChar(f, numGrp, N) : things = [f%x for x in range(numGrp)] return [things[x] for x in np.random.choice(numGrp, N)] def randFloat(numGrp, N) : things = [round(100*np.random.random(),4) for x in range(numGrp)] return [things[x] for x in np.random.choice(numGrp, N)] N=int(1e8) K=100 DF = pd.DataFrame({ 'id1' : randChar("id%03d", K, N), # large groups (char) 'id2' : randChar("id%03d", K, N), # large groups (char) 'id3' : randChar("id%010d", N//K, N), # small groups (char) 'id4' : np.random.choice(K, N), # large groups (int) 'id5' : np.random.choice(K, N), # large groups (int) 'id6' : np.random.choice(N//K, N), # small groups (int) 'v1' : np.random.choice(5, N), # int in range [1,5] 'v2' : np.random.choice(5, N), # int in range [1,5] 'v3' : randFloat(100,N) # numeric e.g. 23.5749 })
Now time 5 different groupings, repeating each one twice to confirm the timing. [I realise timeit(2)
runs it twice, but then it reports the total. I'm interested in the time of the first and second run separately.] Python uses about 10G of RAM according to htop
during these tests.
>>> timeit.Timer("DF.groupby(['id1']).agg({'v1':'sum'})" ,"from __main__ import DF").timeit(1) 5.604133386000285 >>> timeit.Timer("DF.groupby(['id1']).agg({'v1':'sum'})" ,"from __main__ import DF").timeit(1) 5.505057081000359 >>> timeit.Timer("DF.groupby(['id1','id2']).agg({'v1':'sum'})" ,"from __main__ import DF").timeit(1) 14.232032927000091 >>> timeit.Timer("DF.groupby(['id1','id2']).agg({'v1':'sum'})" ,"from __main__ import DF").timeit(1) 14.242601240999647 >>> timeit.Timer("DF.groupby(['id3']).agg({'v1':'sum', 'v3':'mean'})" ,"from __main__ import DF").timeit(1) 22.87025260900009 >>> timeit.Timer("DF.groupby(['id3']).agg({'v1':'sum', 'v3':'mean'})" ,"from __main__ import DF").timeit(1) 22.393589012999655 >>> timeit.Timer("DF.groupby(['id4']).agg({'v1':'mean', 'v2':'mean', 'v3':'mean'})" ,"from __main__ import DF").timeit(1) 2.9725865330001398 >>> timeit.Timer("DF.groupby(['id4']).agg({'v1':'mean', 'v2':'mean', 'v3':'mean'})" ,"from __main__ import DF").timeit(1) 2.9683854739996605 >>> timeit.Timer("DF.groupby(['id6']).agg({'v1':'sum', 'v2':'sum', 'v3':'sum'})" ,"from __main__ import DF").timeit(1) 12.776488024999708 >>> timeit.Timer("DF.groupby(['id6']).agg({'v1':'sum', 'v2':'sum', 'v3':'sum'})" ,"from __main__ import DF").timeit(1) 13.558292575999076
Here is system info :
$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 62 Stepping: 4 CPU MHz: 2500.048 BogoMIPS: 5066.38 Hypervisor vendor: Xen Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 25600K NUMA node0 CPU(s): 0-7,16-23 NUMA node1 CPU(s): 8-15,24-31 $ free -h total used free shared buffers cached Mem: 240G 74G 166G 372K 33M 550M -/+ buffers/cache: 73G 166G Swap: 0B 0B 0B
I don't believe it's relevant but just in case, the randChar
function above is a workaround for a memory error in mtrand.RandomState.choice
:
How to solve memory error in mtrand.RandomState.choice?
Dask runs faster than pandas for this query, even when the most inefficient column type is used, because it parallelizes the computations. pandas only uses 1 CPU core to run the query. My computer has 4 cores and Dask uses all the cores to run the computation.
The query function seams more efficient than the loc function. DF2: 2K records x 6 columns. The loc function seams much more efficient than the query function.
You can speed up the execution even faster by using another trick: making your pandas' dataframes lighter by using more efficent data types. As we know that df only contains integers from 1 to 10, we can then reduce the data type from 64 bits to 16 bits. See how we reduced the size of our dataframe from 38MB to 9.5MB.
If you'd like to install the iPython shell, you can easily time your code using %timeit. After installing it, instead of typing python
to launch the python interpreter, you would type ipython
.
Then you can type your code exactly as you would type it in the normal interpreter (as you did above).
Then you can type, for example:
%timeit DF.groupby(['id1']).agg({'v1':'sum'})
This will accomplish exactly the same thing as what you've done, but if you're using python a lot I find that this will save you significant typing time :).
Ipython has a lot of other nice features (like %paste
, which I used to paste in your code and test this, or %run
to run a script you've saved in a file), tab completion, etc. http://ipython.org/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With