Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Updating the values of a column in a dask dataframe based on some condition on some other columns

Tags:

We have a very large CSV file which has been imported as a dask dataframe. I make a small example to explain the question.

import dask.dataframe as dd
df = dd.read_csv("name and path of the file.csv")
df.head()

output

col1 | col2 | col3 | col4 
22   | Nan  | 23   |  56
12   |  54  | 22   |  36
48   | Nan  | 2    |  45
76   | 32   | 13   |  6
23   | Nan  | 43   |  8
67   | 54   | 56   |  64
16   | 32   | 32   |  6
3    | 54   | 64   |  8
67   | NaN  | 23   |  64

I want to replace the value of col4 with col1 if col4<col1 and col2 is not NaN

So the result should be

col1| col2  | col3 | col4 
22  | Nan   | 23   |  56
12  |  54   | 22   |  36
48  | Nan   | 2    |  45
76  | 32    | 13   |  76
23  | Nan   | 43   |  8
67  | 54    | 56   |  67
16  | 32    | 32   |  16
3   | 54    | 64   |  8
67  | NaN   | 23   |  64

I know how to do it on pandas:

condition= df[(df['col4'] < df['col1']) & (pd.notnull(df['col2']))].index

df.loc[condition,'col4'] = df.loc[condition, 'col1'].values
like image 944
Monirrad Avatar asked Jan 22 '19 06:01

Monirrad


People also ask

How do you replace all values in a DataFrame based on a condition?

You can replace values of all or selected columns based on the condition of pandas DataFrame by using DataFrame. loc[ ] property. The loc[] is used to access a group of rows and columns by label(s) or a boolean array. It can access and can also manipulate the values of pandas DataFrame.

What is Npartitions in Dask?

The npartitions property is the number of Pandas dataframes that compose a single Dask dataframe. This affects performance in two main ways. If you don't have enough partitions then you may not be able to use all of your cores effectively. For example if your dask.


1 Answers

I think you need:

condition = (df['col4'] < df['col1']) & (pd.notnull(df['col2']))
df.loc[condition,'col4'] = df.loc[condition, 'col1']

Or dask.dataframe.Series.mask:

df['col4'] = df['col4'].mask(condition, df['col1'])

print (df)
   col1  col2  col3  col4
0    22   NaN    23    56
1    12  54.0    22    36
2    48   NaN     2    45
3    76  32.0    13    76
4    23   NaN    43     8
5    67  54.0    56    67
6    16  32.0    32    16
7     3  54.0    64     8
8    67   NaN    23    64
like image 147
jezrael Avatar answered Sep 29 '22 08:09

jezrael