Groupby and shift a dask dataframe

Tags:

dask

I would like to scale some operations I do on pandas dataframe using dask 2.14. For example I would like to apply a shift on a column of a dataframe:

import dask.dataframe as dd
data = dd.read_csv('some_file.csv')
data.set_index('column_A')
data['column_B'] = data.groupby(['column_A'])['column_B'].shift(-1)

but I get AttributeError: 'SeriesGroupBy' object has no attribute 'shift' I read the dask documentation and I saw that there is not such a method (while there was in pandas)

Can you suggest some valid alternative?

Thank you

459

asked May 05 '20 11:05

1 Answers

There is an open ticket about this on GitHub. Essentially, you will have to use apply to get around it. I'm not sure whether this carries performance implications in dask. There is a further ticket referencing the issue and stating that it lies in pandas, but it's been open for some time.

This should be equivalent to the pandas operation:

import dask.dataframe as dd
import pandas as pd
import random

df = pd.DataFrame({'a': list(range(10)),
                   'b': random.choices(['x', 'y'], k=10)})

print("####### PANDAS ######")
print("Initial df")
print(df.head(10))
print("................")

pandas_df = df.copy()
print("Final df")

pandas_df['a'] = pandas_df.groupby(['b'])['a'].apply(lambda x: x.shift(-1))

print(pandas_df.head(10))
print()


print("####### DASK ######")
print("Initial df")
dask_df = dd.from_pandas(df, npartitions=1).reset_index()
print(dask_df.head(10))
print("................")

dask_df['a'] = dask_df.groupby(['b'])['a'].apply(lambda x: x.shift(-1))

print("Final df")
print(dask_df.head(10))

I obviously can't benchmark the approach in dask since there seems to be no alternative. However, I can in pandas:

import string

import numpy as np
import pandas as pd


df = pd.DataFrame({'a': list(range(100000)),
                   'b': np.random.choice(list(string.ascii_lowercase), 100000)
                   })

def normal_way(df):
    df = df.groupby(['b'])['a'].shift(-1)

def apply_way(df):
    df = df.groupby(['b'])['a'].apply(lambda x: x.shift(-1))

The timeit results are:

%timeit normal_way(df)
4.25 ms ± 98 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit apply_way(df)
15 ms ± 446 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

129

answered Sep 18 '22 15:09

roganjosh

Related questions
                            
                                TypeError: cannot insert an item into a CategoricalIndex that is not already an existing category
                            
                                Pytorch ImageNet dataset
                            
                                Pyspark: how to extract hour from timestamp
                            
                                How to avoid conda activate base from automatically executing in my VS Code editor?
                            
                                unauthorized_client: Grant type 'authorization_code' not allowed for the client. Django -auth0 -login
                            
                                How to replace loss function during training tensorflow.keras
                            
                                Django: how to get Foreign key id?
                            
                                find least common denominator for list of fractions in python
                            
                                Reindex MultiIndex with unique values in level
                            
                                Can I convert spectrograms generated with librosa back to audio?
                            
                                How do I setup my own time zone in Django?
                            
                                Librosa raised OSError('sndfile library not found') in Docker
                            
                                AttributeError: module 'os' has no attribute 'uname
                            
                                Discord.py - how to detect if a user mentions/pings the bot
                            
                                Is this a bug or do I not understand something?
                            
                                Change colors in python dash plotly theme
                            
                                Python unittest setting a global variable correctly
                            
                                Import error: No module named 'secrets' - python manage.py not working after pull to Digital Ocean
                            
                                Difference between rect.move() and rect.move_ip in pygame
                            
                                Passing Ipython variables as string arguments to shell command

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Groupby and shift a dask dataframe

Tags:

python

dask

Luca Monno

People also ask

1 Answers

roganjosh

Recent Activity

Donate For Us