Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Assign (add) a new column to a dask dataframe based on values of 2 existing columns - involves a conditional statement

I would like to add a new column to an existing dask dataframe based on the values of the 2 existing columns and involves a conditional statement for checking nulls:

DataFrame definition

import pandas as pd
import dask.dataframe as dd

df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [0.2, "", 0.345, 0.40, 0.15]})
ddf = dd.from_pandas(df1, npartitions=2)

Method-1 tried

def funcUpdate(row):
    if row['y'].isnull():
        return row['y']
    else:
        return  round((1 + row['x'])/(1+ 1/row['y']),4)

ddf = ddf.assign(z= ddf.apply(funcUpdate, axis=1 , meta = ddf))

It gives an error:

TypeError: Column assignment doesn't support type DataFrame

Method-2

ddf = ddf.assign(z = ddf.apply(lambda col: col.y if col.y.isnull() else  round((1 + col.x)/(1+ 1/col.y),4),axis = 1, meta = ddf))

Any idea how it should be done ?

like image 281
ML_Passion Avatar asked Feb 13 '17 19:02

ML_Passion


People also ask

How do you create a new column in pandas DataFrame based on other columns?

Using apply() method If you need to apply a method over an existing column in order to compute some values that will eventually be added as a new column in the existing DataFrame, then pandas. DataFrame. apply() method should do the trick.

How do you add a column to a DataFrame to another data frame?

After extraction, the column needs to be simply added to the second dataframe using join() function. This function needs to be called with reference to the dataframe in which the column has to be added and the variable name which stores the extracted column name has to be passed to it as the argument.

How do I add a conditional column in pandas?

You can create a conditional column in pandas DataFrame by using np. where() , np. select() , DataFrame. map() , DataFrame.


1 Answers

You can either use fillna (fast) or you can use apply (slow but flexible)

Fillna

import pandas as pd

import dask.dataframe as dd
df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [0.2, None, 0.345, 0.40, 0.15]})
ddf = dd.from_pandas(df, npartitions=2)

ddf['z'] = ddf.y.fillna((100 + ddf.x))

>>> df

   x      y
0  1  0.200
1  2    NaN
2  3  0.345
3  4  0.400
4  5  0.150

>>> ddf.compute()

   x      y        z
0  1  0.200    0.200
1  2    NaN  102.000
2  3  0.345    0.345
3  4  0.400    0.400
4  5  0.150    0.150

Of course in this case though because your function uses y if y is a null, the result will be null as well. I'm assuming that you didn't intend this, so I changed the output slightly.

Use apply

As any Pandas expert will tell you, using apply comes with a 10x to 100x slowdown penalty. Please beware.

That being said, the flexibility is useful. Your example almost works, except that you are providing improper metadata. You are telling apply that the function produces a dataframe, when in fact I think that your function was intended to produce a series. You can have Dask guess the meta information for you (although it will complain) or you can specify the dtype explicitly. Both options are shown in the example below:

In [1]: import pandas as pd
   ...: 
   ...: import dask.dataframe as dd
   ...: df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [0.2, None, 0.345, 0.40, 0.15]})
   ...: ddf = dd.from_pandas(df, npartitions=2)
   ...: 

In [2]: def func(row):
   ...:     if pd.isnull(row['y']):
   ...:         return row['x'] + 100
   ...:     else:
   ...:         return row['y']
   ...:     

In [3]: ddf['z'] = ddf.apply(func, axis=1)
/home/mrocklin/Software/anaconda/lib/python3.4/site-packages/dask/dataframe/core.py:2553: UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
  Before: .apply(func)
  After:  .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
  or:     .apply(func, meta=('x', 'f8'))            for series result
  warnings.warn(msg)

In [4]: ddf.compute()
Out[4]: 
   x      y        z
0  1  0.200    0.200
1  2    NaN  102.000
2  3  0.345    0.345
3  4  0.400    0.400
4  5  0.150    0.150

In [5]: ddf['z'] = ddf.apply(func, axis=1, meta=float)

In [6]: ddf.compute()
Out[6]: 
   x      y        z
0  1  0.200    0.200
1  2    NaN  102.000
2  3  0.345    0.345
3  4  0.400    0.400
4  5  0.150    0.150
like image 109
MRocklin Avatar answered Sep 28 '22 00:09

MRocklin