We have a very large CSV file which has been imported as a dask dataframe. I make a small example to explain the question. <pre class="prettyprint"><code>import dask.dataframe as dd df = dd.read_csv("name and path of the file.csv") df.head() </code></pre> output <pre class="prettyprint"><code>col1 | col2 | col3 | col4 22 | Nan | 23 | 56 12 | 54 | 22 | 36 48 | Nan | 2 | 45 76 | 32 | 13 | 6 23 | Nan | 43 | 8 67 | 54 | 56 | 64 16 | 32 | 32 | 6 3 | 54 | 64 | 8 67 | NaN | 23 | 64 </code></pre> I want to replace the value of <code>col4</code> with <code>col1</code> if <code>col4<col1</code> and <code>col2</code> is not <code>NaN</code> So the result should be <pre class="prettyprint"><code>col1| col2 | col3 | col4 22 | Nan | 23 | 56 12 | 54 | 22 | 36 48 | Nan | 2 | 45 76 | 32 | 13 | 76 23 | Nan | 43 | 8 67 | 54 | 56 | 67 16 | 32 | 32 | 16 3 | 54 | 64 | 8 67 | NaN | 23 | 64 </code></pre> I know how to do it on pandas: <pre class="prettyprint"><code>condition= df[(df['col4'] < df['col1']) & (pd.notnull(df['col2']))].index df.loc[condition,'col4'] = df.loc[condition, 'col1'].values </code></pre>

I think you need: <pre class="prettyprint"><code>condition = (df['col4'] < df['col1']) & (pd.notnull(df['col2'])) df.loc[condition,'col4'] = df.loc[condition, 'col1'] </code></pre> Or <code>dask.dataframe.Series.mask</code>: <pre class="prettyprint"><code>df['col4'] = df['col4'].mask(condition, df['col1']) print (df) col1 col2 col3 col4 0 22 NaN 23 56 1 12 54.0 22 36 2 48 NaN 2 45 3 76 32.0 13 76 4 23 NaN 43 8 5 67 54.0 56 67 6 16 32.0 32 16 7 3 54.0 64 8 8 67 NaN 23 64 </code></pre>

Updating the values of a column in a dask dataframe based on some condition on some other columns

Q: How do you replace all values in a DataFrame based on a condition?

You can replace values of all or selected columns based on the condition of pandas DataFrame by using DataFrame. loc[ ] property. The loc[] is used to access a group of rows and columns by label(s) or a boolean array. It can access and can also manipulate the values of pandas DataFrame.

Q: What is Npartitions in Dask?

The npartitions property is the number of Pandas dataframes that compose a single Dask dataframe. This affects performance in two main ways. If you don't have enough partitions then you may not be able to use all of your cores effectively. For example if your dask.

Tags:

We have a very large CSV file which has been imported as a dask dataframe. I make a small example to explain the question.

import dask.dataframe as dd
df = dd.read_csv("name and path of the file.csv")
df.head()

output

col1 | col2 | col3 | col4 
22   | Nan  | 23   |  56
12   |  54  | 22   |  36
48   | Nan  | 2    |  45
76   | 32   | 13   |  6
23   | Nan  | 43   |  8
67   | 54   | 56   |  64
16   | 32   | 32   |  6
3    | 54   | 64   |  8
67   | NaN  | 23   |  64

I want to replace the value of col4 with col1 if col4<col1 and col2 is not NaN

So the result should be

col1| col2  | col3 | col4 
22  | Nan   | 23   |  56
12  |  54   | 22   |  36
48  | Nan   | 2    |  45
76  | 32    | 13   |  76
23  | Nan   | 43   |  8
67  | 54    | 56   |  67
16  | 32    | 32   |  16
3   | 54    | 64   |  8
67  | NaN   | 23   |  64

I know how to do it on pandas:

condition= df[(df['col4'] < df['col1']) & (pd.notnull(df['col2']))].index

df.loc[condition,'col4'] = df.loc[condition, 'col1'].values

944

asked Jan 22 '19 06:01

Monirrad

1 Answers

I think you need:

condition = (df['col4'] < df['col1']) & (pd.notnull(df['col2']))
df.loc[condition,'col4'] = df.loc[condition, 'col1']

Or dask.dataframe.Series.mask:

df['col4'] = df['col4'].mask(condition, df['col1'])

print (df)
   col1  col2  col3  col4
0    22   NaN    23    56
1    12  54.0    22    36
2    48   NaN     2    45
3    76  32.0    13    76
4    23   NaN    43     8
5    67  54.0    56    67
6    16  32.0    32    16
7     3  54.0    64     8
8    67   NaN    23    64

147

answered Sep 29 '22 08:09

jezrael

Related questions
                            
                                Required Cloudformation Script for Blue/Green deployment on ECS
                            
                                How to test login to GSuite calendar using Cypress
                            
                                Efficient Python for word pair co-occurrence counting?
                            
                                Is there a created() for vuex actions to auto dispatch
                            
                                How to accelerate C++ writing speed to the speed tested by CrystalDiskMark?
                            
                                Why is PSIXDISTS excluded from modules.perl6.org's rsync?
                            
                                kubelet saying node "master01" not found
                            
                                Removing debug macros in Rust
                            
                                how can I access attributes that have the same name as reserved keywords?
                            
                                React router goBack() not working properly
                            
                                Azure VMs seems to kill long running MySql queries
                            
                                How to change Month text color in CalendarView

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With