Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace rows in one data frame if they appear in another data frame

Tags:

r

I have the following two data frames:

df1

id   V1 V2 V3
210  4  NA 7
220  NA NA NA
230  2  0  1
240  4  NA NA
250  1  9  2
260  6  5  NA
270  0  NA 3

df2

id   V1 V2 V3
210  4  3  7
240  4  3  NA
270  0  3 3

df2 is all the instances where df1 has NA in V2 and at least one numeric value in V1 or V3. Where this condition holds, I have changed the NAs in V2 to '3'.

I would now like to put these dfs back together. Specifically, I would like to replace all the rows in df1 that appear in df2. My expected output is this:

id   V1 V2 V3
210  4  3 7
220  NA NA NA
230  2  0  1
240  4  3 NA
250  1  9  2
260  6  5  NA
270  0  3 3

I have looked at this question, but it does this based on specific values in the df. And this question is similarly answered by specifying the actual values to replace. My real df is huge and all I want to do is put the two dfs together, replacing the rows that appear in both with df2.

like image 492
szi Avatar asked Jun 18 '15 09:06

szi


People also ask

How do you update a PySpark DataFrame with new values from another DataFrame?

You can do update a PySpark DataFrame Column using withColum(), select() and sql(), since DataFrame's are distributed immutable collection you can't really change the column values however when you change the value using withColumn() or any approach, PySpark returns a new Dataframe with updated values.

How do you replace a column in a DataFrame with another column?

In order to replace a value in Pandas DataFrame, use the replace() method with the column the from and to values.


1 Answers

A simple match call that will identify the instances that match df2$id within df1$id (in the correct appearance order) will solve this problem

df1[match(df2$id, df1$id), ] <- df2
df1
#    id V1 V2 V3
# 1 210  4  3  7
# 2 220 NA NA NA
# 3 230  2  0  1
# 4 240  4  3 NA
# 5 250  1  9  2
# 6 260  6  5 NA
# 7 270  0  3  3

Edit: As @plafort points out, you could avoid creating df2 in the first place, but I would go with vectorized approach instead of using apply. For example

indx <- rowSums(is.na(df1)) != (ncol(df1) - 1) & is.na(df1$V2)
df1[indx, "V2"] <- 3
like image 137
David Arenburg Avatar answered Sep 19 '22 00:09

David Arenburg