I have the following two data frames:
df1
id V1 V2 V3
210 4 NA 7
220 NA NA NA
230 2 0 1
240 4 NA NA
250 1 9 2
260 6 5 NA
270 0 NA 3
df2
id V1 V2 V3
210 4 3 7
240 4 3 NA
270 0 3 3
df2 is all the instances where df1 has NA in V2 and at least one numeric value in V1 or V3. Where this condition holds, I have changed the NAs in V2 to '3'.
I would now like to put these dfs back together. Specifically, I would like to replace all the rows in df1 that appear in df2. My expected output is this:
id V1 V2 V3
210 4 3 7
220 NA NA NA
230 2 0 1
240 4 3 NA
250 1 9 2
260 6 5 NA
270 0 3 3
I have looked at this question, but it does this based on specific values in the df. And this question is similarly answered by specifying the actual values to replace. My real df is huge and all I want to do is put the two dfs together, replacing the rows that appear in both with df2.
You can do update a PySpark DataFrame Column using withColum(), select() and sql(), since DataFrame's are distributed immutable collection you can't really change the column values however when you change the value using withColumn() or any approach, PySpark returns a new Dataframe with updated values.
In order to replace a value in Pandas DataFrame, use the replace() method with the column the from and to values.
A simple match
call that will identify the instances that match df2$id
within df1$id
(in the correct appearance order) will solve this problem
df1[match(df2$id, df1$id), ] <- df2
df1
# id V1 V2 V3
# 1 210 4 3 7
# 2 220 NA NA NA
# 3 230 2 0 1
# 4 240 4 3 NA
# 5 250 1 9 2
# 6 260 6 5 NA
# 7 270 0 3 3
Edit:
As @plafort points out, you could avoid creating df2
in the first place, but I would go with vectorized approach instead of using apply
. For example
indx <- rowSums(is.na(df1)) != (ncol(df1) - 1) & is.na(df1$V2)
df1[indx, "V2"] <- 3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With