I would like to add a new column to an existing dask dataframe based on the values of the 2 existing columns and involves a conditional statement for checking nulls:
DataFrame definition
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [0.2, "", 0.345, 0.40, 0.15]})
ddf = dd.from_pandas(df1, npartitions=2)
Method-1 tried
def funcUpdate(row):
if row['y'].isnull():
return row['y']
else:
return round((1 + row['x'])/(1+ 1/row['y']),4)
ddf = ddf.assign(z= ddf.apply(funcUpdate, axis=1 , meta = ddf))
It gives an error:
TypeError: Column assignment doesn't support type DataFrame
Method-2
ddf = ddf.assign(z = ddf.apply(lambda col: col.y if col.y.isnull() else round((1 + col.x)/(1+ 1/col.y),4),axis = 1, meta = ddf))
Any idea how it should be done ?
Using apply() method If you need to apply a method over an existing column in order to compute some values that will eventually be added as a new column in the existing DataFrame, then pandas. DataFrame. apply() method should do the trick.
After extraction, the column needs to be simply added to the second dataframe using join() function. This function needs to be called with reference to the dataframe in which the column has to be added and the variable name which stores the extracted column name has to be passed to it as the argument.
You can create a conditional column in pandas DataFrame by using np. where() , np. select() , DataFrame. map() , DataFrame.
You can either use fillna
(fast) or you can use apply
(slow but flexible)
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [0.2, None, 0.345, 0.40, 0.15]})
ddf = dd.from_pandas(df, npartitions=2)
ddf['z'] = ddf.y.fillna((100 + ddf.x))
>>> df
x y
0 1 0.200
1 2 NaN
2 3 0.345
3 4 0.400
4 5 0.150
>>> ddf.compute()
x y z
0 1 0.200 0.200
1 2 NaN 102.000
2 3 0.345 0.345
3 4 0.400 0.400
4 5 0.150 0.150
Of course in this case though because your function uses y
if y
is a null, the result will be null as well. I'm assuming that you didn't intend this, so I changed the output slightly.
As any Pandas expert will tell you, using apply
comes with a 10x to 100x slowdown penalty. Please beware.
That being said, the flexibility is useful. Your example almost works, except that you are providing improper metadata. You are telling apply that the function produces a dataframe, when in fact I think that your function was intended to produce a series. You can have Dask guess the meta information for you (although it will complain) or you can specify the dtype explicitly. Both options are shown in the example below:
In [1]: import pandas as pd
...:
...: import dask.dataframe as dd
...: df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [0.2, None, 0.345, 0.40, 0.15]})
...: ddf = dd.from_pandas(df, npartitions=2)
...:
In [2]: def func(row):
...: if pd.isnull(row['y']):
...: return row['x'] + 100
...: else:
...: return row['y']
...:
In [3]: ddf['z'] = ddf.apply(func, axis=1)
/home/mrocklin/Software/anaconda/lib/python3.4/site-packages/dask/dataframe/core.py:2553: UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
Before: .apply(func)
After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
or: .apply(func, meta=('x', 'f8')) for series result
warnings.warn(msg)
In [4]: ddf.compute()
Out[4]:
x y z
0 1 0.200 0.200
1 2 NaN 102.000
2 3 0.345 0.345
3 4 0.400 0.400
4 5 0.150 0.150
In [5]: ddf['z'] = ddf.apply(func, axis=1, meta=float)
In [6]: ddf.compute()
Out[6]:
x y z
0 1 0.200 0.200
1 2 NaN 102.000
2 3 0.345 0.345
3 4 0.400 0.400
4 5 0.150 0.150
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With