I have a 12 million rows dataset, with 3 columns as unique identifiers and another 2 columns with values. I'm trying to do a rather simple task:
- group by the three identifiers. This yields about 2.6 million unique combinations
- Task 1: calculate the median for column Val1
- Task 2: calculate the mean for column Val1
given some condition on Val2
Here are my results, using pandas
and data.table
(both latest versions at the moment, on the same machine):
+-----------------+-----------------+------------+
| | pandas | data.table |
+-----------------+-----------------+------------+
| TASK 1 | 150 seconds | 4 seconds |
| TASK 1 + TASK 2 | doesn't finish | 5 seconds |
+-----------------+-----------------+------------+
I think I may be doing something wrong with pandas - transforming Grp1
and Grp2
into categories didn't help a lot, nor did switching between .agg
and .apply
. Any ideas?
Below is the reproducible code.
Dataframe generation:
import numpy as np
import pandas as pd
from collections import OrderedDict
import time
np.random.seed(123)
list1 = list(pd.util.testing.rands_array(10, 750))
list2 = list(pd.util.testing.rands_array(10, 700))
list3 = list(np.random.randint(100000,200000,5))
N = 12 * 10**6 # please make sure you have enough RAM
df = pd.DataFrame({'Grp1': np.random.choice(list1, N, replace = True),
'Grp2': np.random.choice(list2, N, replace = True),
'Grp3': np.random.choice(list3, N, replace = True),
'Val1': np.random.randint(0,100,N),
'Val2': np.random.randint(0,10,N)})
# this works and shows there are 2,625,000 unique combinations
df_test = df.groupby(['Grp1','Grp2','Grp3']).size()
print(df_test.shape[0]) # 2,625,000 rows
# export to feather so that same df goes into R
df.to_feather('file.feather')
Task 1 in Python:
# TASK 1: 150 seconds (sorted / not sorted doesn't seem to matter)
df.sort_values(['Grp1','Grp2','Grp3'], inplace = True)
t0 = time.time()
df_agg1 = df.groupby(['Grp1','Grp2','Grp3']).agg({'Val1':[np.median]})
t1 = time.time()
print("Duration for complex: %s seconds ---" % (t1 - t0))
Task 1 + Task 2 in Python:
# TASK 1 + TASK 2: this kept running for 10 minutes to no avail
# (sorted / not sorted doesn't seem to matter)
def f(x):
d = OrderedDict()
d['Median_all'] = np.median(x['Val1'])
d['Median_lt_5'] = np.median(x['Val1'][x['Val2'] < 5])
return pd.Series(d)
t0 = time.time()
df_agg2 = df.groupby(['Grp1','Grp2','Grp3']).apply(f)
t1 = time.time()
print("Duration for complex: %s seconds ---" % (t1 - t0)) # didn't complete
Equivalent R code:
library(data.table)
library(feather)
DT = setDT(feater("file.feather"))
system.time({
DT_agg <- DT[,.(Median_all = median(Val1),
Median_lt_5 = median(Val1[Val2 < 5]) ), by = c('Grp1','Grp2','Grp3')]
}) # 5 seconds
While the process takes 16.62 seconds for Pandas, Datatable is only at 6.55 seconds. Overall Datatable is 2 times faster than Pandas.
There are a number of reasons why data. table is fast, but a key one is that unlike many other tools, it allows you to modify things in your table by reference, so it is changed in-situ rather than requiring the object to be recreated with your modifications. That means that when I'm using data.
Dask runs faster than pandas for this query, even when the most inefficient column type is used, because it parallelizes the computations. pandas only uses 1 CPU core to run the query. My computer has 4 cores and Dask uses all the cores to run the computation.
Speed - Pandas is much faster than Excel, which is especially noticeable when working with larger quantities of data. Automation - A lot of the tasks that can be achieved with Pandas are extremely easy to automate, reducing the amount of tedious and repetitive tasks that need to be performed daily.
I can't reproduce your R results, I fixed the typo where you misspelled feather, but I get the following:
Error in `[.data.table`(DT, , .(Median_all = median(Val1), Median_lt_5 = median(Val1[Val2 < :
column or expression 1 of 'by' or 'keyby' is type NULL. Do not quote column names. Usage: DT[,sum(colC),by=list(colA,month(colB))]
As to the python example, If you want to get the median for each group where Val2 is less than 5 then you should filter first, as in:
df[df.Val2 < 5].groupby(['Grp1','Grp2','Grp3'])['Val2'].median()
This completes in under 8 seconds on my macbook pro.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With