Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is R's data.table so much faster than pandas?

Tags:

I have a 12 million rows dataset, with 3 columns as unique identifiers and another 2 columns with values. I'm trying to do a rather simple task:
- group by the three identifiers. This yields about 2.6 million unique combinations
- Task 1: calculate the median for column Val1
- Task 2: calculate the mean for column Val1 given some condition on Val2

Here are my results, using pandas and data.table (both latest versions at the moment, on the same machine):

+-----------------+-----------------+------------+
|                 |      pandas     | data.table |
+-----------------+-----------------+------------+
| TASK 1          | 150 seconds     | 4 seconds  |
| TASK 1 + TASK 2 |  doesn't finish | 5 seconds  |
+-----------------+-----------------+------------+

I think I may be doing something wrong with pandas - transforming Grp1 and Grp2 into categories didn't help a lot, nor did switching between .agg and .apply. Any ideas?

Below is the reproducible code.
Dataframe generation:

import numpy as np
import pandas as pd
from collections import OrderedDict
import time

np.random.seed(123)
list1 = list(pd.util.testing.rands_array(10, 750))
list2 = list(pd.util.testing.rands_array(10, 700))
list3 = list(np.random.randint(100000,200000,5))

N = 12 * 10**6 # please make sure you have enough RAM
df = pd.DataFrame({'Grp1': np.random.choice(list1, N, replace = True),
                   'Grp2': np.random.choice(list2, N, replace = True),
                   'Grp3': np.random.choice(list3, N, replace = True),
                   'Val1': np.random.randint(0,100,N),
                   'Val2': np.random.randint(0,10,N)}) 


# this works and shows there are 2,625,000 unique combinations
df_test = df.groupby(['Grp1','Grp2','Grp3']).size()
print(df_test.shape[0]) # 2,625,000 rows

# export to feather so that same df goes into R
df.to_feather('file.feather')

Task 1 in Python:

# TASK 1: 150 seconds (sorted / not sorted doesn't seem to matter)
df.sort_values(['Grp1','Grp2','Grp3'], inplace = True)
t0 = time.time()
df_agg1 = df.groupby(['Grp1','Grp2','Grp3']).agg({'Val1':[np.median]})
t1 = time.time()
print("Duration for complex: %s seconds ---" % (t1 - t0))

Task 1 + Task 2 in Python:

# TASK 1 + TASK 2: this kept running for 10 minutes to no avail
# (sorted / not sorted doesn't seem to matter)
def f(x):
    d = OrderedDict()
    d['Median_all'] = np.median(x['Val1'])
    d['Median_lt_5'] = np.median(x['Val1'][x['Val2'] < 5])
    return pd.Series(d)

t0 = time.time()
df_agg2 = df.groupby(['Grp1','Grp2','Grp3']).apply(f)
t1 = time.time()
print("Duration for complex: %s seconds ---" % (t1 - t0)) # didn't complete

Equivalent R code:

library(data.table)
library(feather)

DT = setDT(feater("file.feather"))
system.time({
DT_agg <- DT[,.(Median_all = median(Val1),
                Median_lt_5 = median(Val1[Val2 < 5])  ), by = c('Grp1','Grp2','Grp3')]
}) # 5 seconds
like image 266
BogdanC Avatar asked Mar 16 '18 13:03

BogdanC


People also ask

Are Pandas faster than data tables?

While the process takes 16.62 seconds for Pandas, Datatable is only at 6.55 seconds. Overall Datatable is 2 times faster than Pandas.

Why data table is so fast?

There are a number of reasons why data. table is fast, but a key one is that unlike many other tools, it allows you to modify things in your table by reference, so it is changed in-situ rather than requiring the object to be recreated with your modifications. That means that when I'm using data.

What is faster than Pandas Dataframe?

Dask runs faster than pandas for this query, even when the most inefficient column type is used, because it parallelizes the computations. pandas only uses 1 CPU core to run the query. My computer has 4 cores and Dask uses all the cores to run the computation.

Why is Pandas so much faster than Excel?

Speed - Pandas is much faster than Excel, which is especially noticeable when working with larger quantities of data. Automation - A lot of the tasks that can be achieved with Pandas are extremely easy to automate, reducing the amount of tedious and repetitive tasks that need to be performed daily.


1 Answers

I can't reproduce your R results, I fixed the typo where you misspelled feather, but I get the following:

Error in `[.data.table`(DT, , .(Median_all = median(Val1), Median_lt_5 = median(Val1[Val2 <  : 
column or expression 1 of 'by' or 'keyby' is type NULL. Do not quote column names. Usage: DT[,sum(colC),by=list(colA,month(colB))] 

As to the python example, If you want to get the median for each group where Val2 is less than 5 then you should filter first, as in:

 df[df.Val2 < 5].groupby(['Grp1','Grp2','Grp3'])['Val2'].median()

This completes in under 8 seconds on my macbook pro.

like image 117
Brad Miller Avatar answered Sep 23 '22 18:09

Brad Miller