Dask dataframe - split column into multiple rows based on delimiter

Tags:

What is an efficient way of splitting a column into multiple rows using dask dataframe? For example, let's say I have a csv file which I read using dask to produce the following dask dataframe:

id var1 var2
1  A    Z,Y
2  B    X
3  C    W,U,V

I would like to convert it to:

id var1 var2
1  A    Z
1  A    Y
2  B    X
3  C    W
3  C    U
3  C    V

I have looked into the answers for Split (explode) pandas dataframe string entry to separate rows and pandas: How do I split text in a column into multiple rows?.

I tried applying the answer given in https://stackoverflow.com/a/17116976/7275290 but dask does not appear to accept the expand keyword in str.split.

I also tried applying the vectorized approach suggested in https://stackoverflow.com/a/40449726/7275290 but then found out that np.repeat isn't implemented in dask with integer arrays (https://github.com/dask/dask/issues/2946).

I tried out a few other methods in pandas but they were really slow - might be faster with dask but I wanted to check first if anyone had success with any particular method. I'm working with a dataset with over 10 million rows and 10 columns (string data). After splitting into rows it'll probably become ~50 million rows.

Thank you for looking into this! I appreciate it.

390

asked Jan 19 '19 22:01

ltt

1 Answers

Dask allows you to use pandas directly for operations that are row-wise (like this) or can be applied one partition at a time. Remember that a Dask dataframe consists of a set of Pandas dataframes.

For the Pandas case you would do this, based on the linked questions:

df = pd.DataFrame([["A", "Z,Y"], ["B", "X"], ["C", "W,U,V"]], 
    columns=['var1', 'var2'])
df.drop('var2', axis=1).join(
    df.var2.str.split(',', expand=True).stack().reset_index(drop=True, level=1).rename('var2'))

so for Dask you can apply exactly the same method via map_partitions, because each row is independent of all others. This maybe would look cleaner if the function passed were written out separately, rather than as a lambda:

d = dd.from_pandas(df, 2)
d.map_partitions(
    lambda df: df.drop('var2', axis=1).join(
        df.var2.str.split(',', expand=True).stack().reset_index(drop=True, level=1).rename('var2')))

if you did .compute() on this, you would get exactly the same result as for the Pandas case above. Likely you will not want to compute your massive dataframe in one go like that, but perform further processing on it.

191

answered Oct 18 '22 04:10

mdurant

Related questions
                            
                                Can all __future__ statements be removed from python code, without affecting its functionality using python 3.7.1?
                            
                                Make patches bigger used as legend inside matplotlib
                            
                                How to filter a pandas DataFrame according to a list of tuples?
                            
                                Detect changes to a nested dictionary with Python
                            
                                Why do I need to shuffle my PCollection for it to autoscale on Cloud Dataflow?
                            
                                how to get rid of spaces between variables and strings when printed
                            
                                Errno 13 Permission denied when running virtualenv
                            
                                How to show labels in Seaborn plots (No handles with labels found to put in legend.)?
                            
                                Equivalent of LIMIT and OFFSET of SQL in pandas?
                            
                                Convert Dictionary to Numpy array
                            
                                Pandas: Adding a df column based on other column with multiple values map to the same new column value
                            
                                Retain order when taking unique rows in a NumPy array
                            
                                "ValueError: Invalid async_mode specified" when bundling a Flask app using cx_Freeze
                            
                                Get rows corresponding to the minimum with pandas groupby
                            
                                Flask view raises "AttributeError: 'function' object has no attribute"
                            
                                A function composition operator in Python
                            
                                Difference between super() and super (className,self) in Python [duplicate]
                            
                                How to correctly upgrade pip using ansible?
                            
                                Appending Pandas DataFrame to existing Excel document
                            
                                Golang equivalent of creating a subprocess in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Dask dataframe - split column into multiple rows based on delimiter

Tags:

performance

python

pandas

dask

ltt

People also ask

1 Answers

mdurant

Recent Activity

Donate For Us