Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dask DataFrame Groupby Partitions

I have some fairly large csv files (~10gb) and would like to take advantage of dask for analysis. However, depending on the number of partitions I set the dask object to read in with, my groupby results change. My understanding was that dask took advantage of the partitions for out-of-core processing benefits, but that it would still return appropriate groupby output. This doesn't seem to be the case and I'm struggling to work out what alternate settings are necessary. Below is a small example:

df = pd.DataFrame({'A': np.arange(100), 'B': np.random.randn(100), 'C': np.random.randn(100), 'Grp1': np.repeat([1, 2], 50), 'Grp2': [3, 4, 5, 6], 25)})

test_dd1 = dd.from_pandas(df, npartitions=1)
test_dd2 = dd.from_pandas(df, npartitions=2)
test_dd5 = dd.from_pandas(df, npartitions=5)
test_dd10 = dd.from_pandas(df, npartitions=10)
test_dd100 = dd.from_pandas(df, npartitions=100)

def test_func(x):
    x['New_Col'] = len(x[x['B'] > 0.]) / len(x['B'])
    return x

test_dd1.groupby(['Grp1', 'Grp2']).apply(test_func).compute().head()
   A               B               C Grp1 Grp2 New_Col
0  0 -0.561376 -1.422286     1     3     0.48
1  1 -1.107799  1.075471     1     3     0.48
2  2 -0.719420 -0.574381     1     3     0.48
3  3 -1.287547 -0.749218     1     3     0.48
4  4  0.677617 -0.908667     1     3     0.48

test_dd2.groupby(['Grp1', 'Grp2']).apply(test_func).compute().head()
   A               B              C  Grp1 Grp2 New_Col
0  0 -0.561376 -1.422286     1     3     0.48
1  1 -1.107799  1.075471     1     3     0.48
2  2 -0.719420 -0.574381     1     3     0.48
3  3 -1.287547 -0.749218     1     3     0.48
4  4  0.677617 -0.908667     1     3     0.48

test_dd5.groupby(['Grp1', 'Grp2']).apply(test_func).compute().head()
   A               B              C  Grp1 Grp2 New_Col
0  0 -0.561376 -1.422286     1     3     0.45
1  1 -1.107799  1.075471     1     3     0.45
2  2 -0.719420 -0.574381     1     3     0.45
3  3 -1.287547 -0.749218     1     3     0.45
4  4  0.677617 -0.908667     1     3     0.45

test_dd10.groupby(['Grp1', 'Grp2']).apply(test_func).compute().head()
   A               B              C  Grp1 Grp2 New_Col
0  0 -0.561376 -1.422286     1     3      0.5
1  1 -1.107799  1.075471     1     3      0.5
2  2 -0.719420 -0.574381     1     3      0.5
3  3 -1.287547 -0.749218     1     3      0.5
4  4  0.677617 -0.908667     1     3      0.5

test_dd100.groupby(['Grp1', 'Grp2']).apply(test_func).compute().head()
   A               B              C  Grp1 Grp2  New_Col
0  0 -0.561376 -1.422286     1     3        0
1  1 -1.107799  1.075471     1     3        0
2  2 -0.719420 -0.574381     1     3        0
3  3 -1.287547 -0.749218     1     3        0
4  4  0.677617 -0.908667     1     3        1

df.groupby(['Grp1', 'Grp2']).apply(test_func).head()
   A               B               C Grp1 Grp2 New_Col
0  0 -0.561376 -1.422286     1     3     0.48
1  1 -1.107799  1.075471     1     3     0.48
2  2 -0.719420 -0.574381     1     3     0.48
3  3 -1.287547 -0.749218     1     3     0.48
4  4  0.677617 -0.908667     1     3     0.48

Does the groupby step only operate within each partition rather than looking over the full dataframe? In this case it's trivial to set npartitions=1 and it doesn't seem to impact performance all that much but since read_csv automatically sets a certain number of partitions how do you setup the call to ensure that groupby results are accurate?

Thanks!

like image 695
Bhage Avatar asked Feb 06 '16 00:02

Bhage


People also ask

How many partitions should I have Dask?

You should aim for partitions that have around 100MB of data each. Additionally, reducing partitions is very helpful just before shuffling, which creates n log(n) tasks relative to the number of partitions. DataFrames with less than 100 partitions are much easier to shuffle than DataFrames with tens of thousands.

How do I partition a Dask DataFrame?

Repartition the DataFrame into two partitions. repartition(2) causes Dask to combine partition 1 and partition 2 into a single partition. Dask's repartition algorithm is smart to coalesce existing partitions and avoid full data shuffles. You can also increase the number of partitions with repartition.

Is Dask faster than pandas?

Let's start with the simplest operation — read a single CSV file. To my surprise, we can already see a huge difference in the most basic operation. Datatable is 70% faster than pandas while dask is 500% faster! The outcomes are all sorts of DataFrame objects which have very identical interfaces.

Is Dask apply parallel?

Dask helps developers scale their entire Python ecosystem, and it can work with your laptop or a container cluster. Dask is a one-stop solution for larger-than-memory data sets, as it provides multicore and distributed parallel execution.


1 Answers

I am surprised by this result. Groupby.apply should return the same results regardless of the number of partitions. If you can supply a reproducible example I encourage you to raise an issue and one of the developers will take a look.

like image 128
MRocklin Avatar answered Sep 20 '22 20:09

MRocklin