Assign (add) a new column to a dask dataframe based on values of 2 existing columns - involves a conditional statement

Tags:

I would like to add a new column to an existing dask dataframe based on the values of the 2 existing columns and involves a conditional statement for checking nulls:

DataFrame definition

import pandas as pd
import dask.dataframe as dd

df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [0.2, "", 0.345, 0.40, 0.15]})
ddf = dd.from_pandas(df1, npartitions=2)

Method-1 tried

def funcUpdate(row):
    if row['y'].isnull():
        return row['y']
    else:
        return  round((1 + row['x'])/(1+ 1/row['y']),4)

ddf = ddf.assign(z= ddf.apply(funcUpdate, axis=1 , meta = ddf))

It gives an error:

TypeError: Column assignment doesn't support type DataFrame

Method-2

ddf = ddf.assign(z = ddf.apply(lambda col: col.y if col.y.isnull() else  round((1 + col.x)/(1+ 1/col.y),4),axis = 1, meta = ddf))

Any idea how it should be done ?

281

asked Feb 13 '17 19:02

ML_Passion

1 Answers

You can either use fillna (fast) or you can use apply (slow but flexible)

Fillna

import pandas as pd

import dask.dataframe as dd
df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [0.2, None, 0.345, 0.40, 0.15]})
ddf = dd.from_pandas(df, npartitions=2)

ddf['z'] = ddf.y.fillna((100 + ddf.x))

>>> df

   x      y
0  1  0.200
1  2    NaN
2  3  0.345
3  4  0.400
4  5  0.150

>>> ddf.compute()

   x      y        z
0  1  0.200    0.200
1  2    NaN  102.000
2  3  0.345    0.345
3  4  0.400    0.400
4  5  0.150    0.150

Of course in this case though because your function uses y if y is a null, the result will be null as well. I'm assuming that you didn't intend this, so I changed the output slightly.

Use apply

As any Pandas expert will tell you, using apply comes with a 10x to 100x slowdown penalty. Please beware.

That being said, the flexibility is useful. Your example almost works, except that you are providing improper metadata. You are telling apply that the function produces a dataframe, when in fact I think that your function was intended to produce a series. You can have Dask guess the meta information for you (although it will complain) or you can specify the dtype explicitly. Both options are shown in the example below:

In [1]: import pandas as pd
   ...: 
   ...: import dask.dataframe as dd
   ...: df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [0.2, None, 0.345, 0.40, 0.15]})
   ...: ddf = dd.from_pandas(df, npartitions=2)
   ...: 

In [2]: def func(row):
   ...:     if pd.isnull(row['y']):
   ...:         return row['x'] + 100
   ...:     else:
   ...:         return row['y']
   ...:     

In [3]: ddf['z'] = ddf.apply(func, axis=1)
/home/mrocklin/Software/anaconda/lib/python3.4/site-packages/dask/dataframe/core.py:2553: UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
  Before: .apply(func)
  After:  .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
  or:     .apply(func, meta=('x', 'f8'))            for series result
  warnings.warn(msg)

In [4]: ddf.compute()
Out[4]: 
   x      y        z
0  1  0.200    0.200
1  2    NaN  102.000
2  3  0.345    0.345
3  4  0.400    0.400
4  5  0.150    0.150

In [5]: ddf['z'] = ddf.apply(func, axis=1, meta=float)

In [6]: ddf.compute()
Out[6]: 
   x      y        z
0  1  0.200    0.200
1  2    NaN  102.000
2  3  0.345    0.345
3  4  0.400    0.400
4  5  0.150    0.150

109

answered Sep 28 '22 00:09

MRocklin

Related questions
                            
                                Signal handler inside a class
                            
                                What's the difference in behaviour between :func: and :meth: roles in Python Sphinx?
                            
                                Qt5: AttributeError: 'module' object has no attribute 'QApplication'
                            
                                Python: Merge list with range list
                            
                                Import Error: Google Analytics API Authorization
                            
                                Adding installed PIP package to path automatically
                            
                                Merging two dictionaries while keeping the original
                            
                                django social authentication error using python-social-auth
                            
                                PySpark DataFrame unable to drop duplicates
                            
                                How can I move a field from one model to another, and still retain the data?
                            
                                shape vs len for numpy array
                            
                                Scikit-learn using GridSearchCV on DecisionTreeClassifier
                            
                                filename.whl is not a supported wheel on this platform
                            
                                TensorFlow getting elements of every row for specific columns
                            
                                Plotting a dataframe as both a 'hist' and 'kde' on the same plot
                            
                                How to hash strings into a float in [0:1]?
                            
                                Permission denied: 'geckodriver.log' while running selenium webdriver in python
                            
                                How to show decimal point only when it's not a whole number?
                            
                                PySpark DataFrame filter using logical AND over list of conditions -- Numpy All Equivalent
                            
                                Checking if a list has duplicate lists

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Assign (add) a new column to a dask dataframe based on values of 2 existing columns - involves a conditional statement

Tags:

python

pandas

dask