How do I change rows and columns in a dask dataframe?

Tags:

dask

There are few issues I am having with Dask Dataframes.

lets say I have a dataframe with 2 columns ['a','b']

if i want a new column c = a + b

in pandas i would do :

df['c'] = df['a'] + df['b']

In dask I am doing the same operation as follows:

df = df.assign(c=(df.a + df.b).compute())

is it possible to write this operation in a better way, similar to what we do in pandas?

Second question is something which is troubling me more.

In pandas if i want to change the value of 'a' for row 2 & 6 to np.pi , I do the following

df.loc[[2,6],'a']  = np.pi

I have not been able to figure out how to do a similar operation in Dask. My logic selects some rows and I only want to change values in those rows.

204

asked Sep 02 '15 21:09

1 Answers

Edit Add New Columns

Setitem syntax now works in dask.dataframe

df['z'] = df.x + df.y

Old answer: Add new columns

You're correct that the setitem syntax doesn't work in dask.dataframe.

df['c'] = ... # mutation not supported

As you suggest you should instead use .assign(...).

df = df.assign(c=df.a + df.b)

In your example you have an unnecessary call to .compute(). Generally you want to call compute only at the very end, once you have your final result.

Change rows

As before, dask.dataframe does not support changing rows in place. Inplace operations are difficult to reason about in parallel codes. At the moment dask.dataframe has no nice alternative operation in this case. I've raised issue #653 for conversation on this topic.

138

answered Oct 16 '22 07:10

MRocklin

Related questions
                            
                                pandas get normalize values from groupby and size?
                            
                                How to create overlay bar plot in pandas
                            
                                How to write a pandas dataframe to CSV file line by line, one line at a time?
                            
                                Plot datetime.time in seaborn
                            
                                Pandas any() returning false with true values present
                            
                                Dask: Drop NAs on columns?
                            
                                Fill missing rows in a python pandas dataframe with repetitive pattern
                            
                                Speeding up pandas profiling analysis using check_correlation?
                            
                                How to find out `DataFrame.to_numpy` did not create a copy
                            
                                Create a symmetric matrix that counts the relational records
                            
                                GroupBy functions in Python Pandas like SUM(col_1*col_2), weighted average etc
                            
                                Linear regression - reduce degrees of freedom
                            
                                Pandas plot function ignores timezone of timeseries
                            
                                Create multiple columns in pandas aggregation function
                            
                                Why Pandas Transform fails if you only have a single column
                            
                                add text to pandas dataframe plot
                            
                                key error and MultiIndex lexsort depth
                            
                                list of pandas read_csv encoding list
                            
                                Setting values in pandas Series is slow, why?
                            
                                Pandas json_normalize produces confusing `KeyError` message?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With