Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dask dataframe - split column into multiple rows based on delimiter

What is an efficient way of splitting a column into multiple rows using dask dataframe? For example, let's say I have a csv file which I read using dask to produce the following dask dataframe:

id var1 var2
1  A    Z,Y
2  B    X
3  C    W,U,V

I would like to convert it to:

id var1 var2
1  A    Z
1  A    Y
2  B    X
3  C    W
3  C    U
3  C    V

I have looked into the answers for Split (explode) pandas dataframe string entry to separate rows and pandas: How do I split text in a column into multiple rows?.

I tried applying the answer given in https://stackoverflow.com/a/17116976/7275290 but dask does not appear to accept the expand keyword in str.split.

I also tried applying the vectorized approach suggested in https://stackoverflow.com/a/40449726/7275290 but then found out that np.repeat isn't implemented in dask with integer arrays (https://github.com/dask/dask/issues/2946).

I tried out a few other methods in pandas but they were really slow - might be faster with dask but I wanted to check first if anyone had success with any particular method. I'm working with a dataset with over 10 million rows and 10 columns (string data). After splitting into rows it'll probably become ~50 million rows.

Thank you for looking into this! I appreciate it.

like image 390
ltt Avatar asked Jan 19 '19 22:01

ltt


People also ask

How do you split a DataFrame in dask?

Just create a dask dataframe. Then use numpy to split the df into two.

Is dask merge faster than pandas?

Using dask instead of pandas to merge large data sets The python package dask is a powerful python package that allows you to do data analytics in parallel which means it should be faster and more memory efficient than pandas .

How do I split a column with multiple values in pandas?

split() function is used to break up single column values into multiple columns based on a specified separator or delimiter. The Series. str. split() function is similar to the Python string split() method, but split() method works on the all Dataframe columns, whereas the Series.

What is dask division?

The Dask client has its own version of an index for the distributed DataFrame as a whole, called divisions . divisions is like an index for the indexes—it tracks which partition will contain a given value (just like pandas's index tracks which row will contain a given value).


1 Answers

Dask allows you to use pandas directly for operations that are row-wise (like this) or can be applied one partition at a time. Remember that a Dask dataframe consists of a set of Pandas dataframes.

For the Pandas case you would do this, based on the linked questions:

df = pd.DataFrame([["A", "Z,Y"], ["B", "X"], ["C", "W,U,V"]], 
    columns=['var1', 'var2'])
df.drop('var2', axis=1).join(
    df.var2.str.split(',', expand=True).stack().reset_index(drop=True, level=1).rename('var2'))

so for Dask you can apply exactly the same method via map_partitions, because each row is independent of all others. This maybe would look cleaner if the function passed were written out separately, rather than as a lambda:

d = dd.from_pandas(df, 2)
d.map_partitions(
    lambda df: df.drop('var2', axis=1).join(
        df.var2.str.split(',', expand=True).stack().reset_index(drop=True, level=1).rename('var2')))

if you did .compute() on this, you would get exactly the same result as for the Pandas case above. Likely you will not want to compute your massive dataframe in one go like that, but perform further processing on it.

like image 191
mdurant Avatar answered Oct 18 '22 04:10

mdurant