Why is R's data.table so much faster than pandas?

Tags:

I have a 12 million rows dataset, with 3 columns as unique identifiers and another 2 columns with values. I'm trying to do a rather simple task:
- group by the three identifiers. This yields about 2.6 million unique combinations
- Task 1: calculate the median for column Val1
- Task 2: calculate the mean for column Val1 given some condition on Val2

Here are my results, using pandas and data.table (both latest versions at the moment, on the same machine):

+-----------------+-----------------+------------+
|                 |      pandas     | data.table |
+-----------------+-----------------+------------+
| TASK 1          | 150 seconds     | 4 seconds  |
| TASK 1 + TASK 2 |  doesn't finish | 5 seconds  |
+-----------------+-----------------+------------+

I think I may be doing something wrong with pandas - transforming Grp1 and Grp2 into categories didn't help a lot, nor did switching between .agg and .apply. Any ideas?

Below is the reproducible code.
Dataframe generation:

import numpy as np
import pandas as pd
from collections import OrderedDict
import time

np.random.seed(123)
list1 = list(pd.util.testing.rands_array(10, 750))
list2 = list(pd.util.testing.rands_array(10, 700))
list3 = list(np.random.randint(100000,200000,5))

N = 12 * 10**6 # please make sure you have enough RAM
df = pd.DataFrame({'Grp1': np.random.choice(list1, N, replace = True),
                   'Grp2': np.random.choice(list2, N, replace = True),
                   'Grp3': np.random.choice(list3, N, replace = True),
                   'Val1': np.random.randint(0,100,N),
                   'Val2': np.random.randint(0,10,N)}) 


# this works and shows there are 2,625,000 unique combinations
df_test = df.groupby(['Grp1','Grp2','Grp3']).size()
print(df_test.shape[0]) # 2,625,000 rows

# export to feather so that same df goes into R
df.to_feather('file.feather')

Task 1 in Python:

# TASK 1: 150 seconds (sorted / not sorted doesn't seem to matter)
df.sort_values(['Grp1','Grp2','Grp3'], inplace = True)
t0 = time.time()
df_agg1 = df.groupby(['Grp1','Grp2','Grp3']).agg({'Val1':[np.median]})
t1 = time.time()
print("Duration for complex: %s seconds ---" % (t1 - t0))

Task 1 + Task 2 in Python:

# TASK 1 + TASK 2: this kept running for 10 minutes to no avail
# (sorted / not sorted doesn't seem to matter)
def f(x):
    d = OrderedDict()
    d['Median_all'] = np.median(x['Val1'])
    d['Median_lt_5'] = np.median(x['Val1'][x['Val2'] < 5])
    return pd.Series(d)

t0 = time.time()
df_agg2 = df.groupby(['Grp1','Grp2','Grp3']).apply(f)
t1 = time.time()
print("Duration for complex: %s seconds ---" % (t1 - t0)) # didn't complete

Equivalent R code:

library(data.table)
library(feather)

DT = setDT(feater("file.feather"))
system.time({
DT_agg <- DT[,.(Median_all = median(Val1),
                Median_lt_5 = median(Val1[Val2 < 5])  ), by = c('Grp1','Grp2','Grp3')]
}) # 5 seconds

266

asked Mar 16 '18 13:03

BogdanC

1 Answers

I can't reproduce your R results, I fixed the typo where you misspelled feather, but I get the following:

Error in `[.data.table`(DT, , .(Median_all = median(Val1), Median_lt_5 = median(Val1[Val2 <  : 
column or expression 1 of 'by' or 'keyby' is type NULL. Do not quote column names. Usage: DT[,sum(colC),by=list(colA,month(colB))]

As to the python example, If you want to get the median for each group where Val2 is less than 5 then you should filter first, as in:

 df[df.Val2 < 5].groupby(['Grp1','Grp2','Grp3'])['Val2'].median()

This completes in under 8 seconds on my macbook pro.

117

answered Sep 23 '22 18:09

Brad Miller

Related questions
                            
                                Android install referrer : pcampaignid=APPU_2
                            
                                Unable to add Geofence in Android because "network location is disabled"
                            
                                Unit test Dagger Fragments with FragmentScenario
                            
                                Is there a way to execute a teardown function after all tests have been run?
                            
                                Google GPDR Consent. AdMob Error: IAB TCF v2.0 errors detected We've detected an issue on your IAB TC string on one or more of your sites or apps
                            
                                How much database performance overhead when using LINQ?
                            
                                Handling browser pop-up windows with Selenium
                            
                                When nesting properties that implement INotifyPropertyChanged must the parent object propagate changes?
                            
                                Slider control in Android...is SeekBar really my best/only option?
                            
                                Java Clustering Library [closed]
                            
                                Finding good heuristic for A* search
                            
                                How to replace black background with white when resizing/converting PNG images with transparent backgrounds to JPEG.

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With