There are few issues I am having with Dask Dataframes.
lets say I have a dataframe with 2 columns ['a','b']
if i want a new column c = a + b
in pandas i would do :
df['c'] = df['a'] + df['b']
In dask I am doing the same operation as follows:
df = df.assign(c=(df.a + df.b).compute())
is it possible to write this operation in a better way, similar to what we do in pandas?
Second question is something which is troubling me more.
In pandas if i want to change the value of 'a'
for row 2 & 6 to np.pi
, I do the following
df.loc[[2,6],'a'] = np.pi
I have not been able to figure out how to do a similar operation in Dask. My logic selects some rows and I only want to change values in those rows.
Use the T attribute or the transpose() method to swap (= transpose) the rows and columns of DataFrame. Neither method changes an original object but returns the new object with the rows and columns swapped (= transposed object).
Just like Pandas, Dask DataFrame supports label-based indexing with the . loc accessor for selecting rows or columns, and __getitem__ (square brackets) for selecting just columns. To select rows, the DataFrame's divisions must be known (see Internal Design and Dask DataFrames Best Practices for more information.)
Setitem syntax now works in dask.dataframe
df['z'] = df.x + df.y
You're correct that the setitem syntax doesn't work in dask.dataframe
.
df['c'] = ... # mutation not supported
As you suggest you should instead use .assign(...)
.
df = df.assign(c=df.a + df.b)
In your example you have an unnecessary call to .compute()
. Generally you want to call compute only at the very end, once you have your final result.
As before, dask.dataframe
does not support changing rows in place. Inplace operations are difficult to reason about in parallel codes. At the moment dask.dataframe
has no nice alternative operation in this case. I've raised issue #653 for conversation on this topic.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With