Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace specific values based on another dataframe

First, let's start with DataFrame 1 (DF1) :

DF1 <- data.frame(c("06/19/2016", "06/20/2016", "06/21/2016", "06/22/2016", 
                    "06/23/2016", "06/19/2016", "06/20/2016", "06/21/2016",
                    "06/22/2016", "06/23/2016"),
                  c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2),
                  c(149, 150, 151, 152, 155, 84, 83, 80, 81, 97),
                  c(101, 102, 104, 107, 99, 55, 55, 56, 57, 58),
                  c("MTL", "MTL", "MTL", "MTL", "MTL", "NY", "NY", 
                    "NY", "NY", "NY"))
colnames(DF1) <- c("date", "id", "sales", "cost", "city")

I also have DataFrame 2 (DF2) :

DF2 <- data.frame(c("06/19/2016", "06/27/2016", "06/22/2016", "06/23/2016"),
                  c(1, 1, 2, 2),
                  c(9999, 8888, 777, 555),
                  c("LON", "LON", "QC", "QC"))
colnames(DF2) <- c("date", "id", "sales", "city")

For every rows in DF1, I have to look if there is a row in DF2 that has the same date and id. If yes, I have to replace the values in DF1 by the values in DF2.

DF2 will always have less columns than DF1. If a column is not in DF2, I must keep the original value that was in DF1 for that specific column.

The final output would like this:

results <- data.frame(c("06/19/2016", "06/20/2016", "06/21/2016", "06/22/2016",
                        "06/23/2016", "06/19/2016", "06/20/2016", "06/21/2016",
                        "06/22/2016", "06/23/2016"),
                      c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2),
                      c(9999, 150, 151, 152, 155, 84, 83, 80, 777, 555),
                      c(101, 102, 104, 107, 99, 55, 55, 56, 57, 58),
                      c("LON", "MTL", "MTL", "MTL", "MTL", "NY", "NY", 
                        "NY", "QC", "QC"))
colnames(results) <- c("date", "id", "sales", "cost", "city")

Do you have any suggestions?

like image 348
Alexis Avatar asked Jun 23 '16 14:06

Alexis


People also ask

How do you change a value in a DataFrame based on a condition?

You can replace values of all or selected columns based on the condition of pandas DataFrame by using DataFrame. loc[ ] property. The loc[] is used to access a group of rows and columns by label(s) or a boolean array. It can access and can also manipulate the values of pandas DataFrame.

How do you replace a column in a DataFrame with another column?

In order to replace a value in Pandas DataFrame, use the replace() method with the column the from and to values.

How do you update a PySpark DataFrame with new values from another DataFrame?

You can do update a PySpark DataFrame Column using withColum(), select() and sql(), since DataFrame's are distributed immutable collection you can't really change the column values however when you change the value using withColumn() or any approach, PySpark returns a new Dataframe with updated values.


2 Answers

You could use the join functionality of the data.table-package for this:

library(data.table)
setDT(DF1)
setDT(DF2)

DF1[DF2, on = .(date, id), `:=` (city = i.city, sales = i.sales)]

which gives:

> DF1
          date id sales cost city
 1: 06/19/2016  1  9999  101  LON
 2: 06/20/2016  1   150  102  MTL
 3: 06/21/2016  1   151  104  MTL
 4: 06/22/2016  1   152  107  MTL
 5: 06/23/2016  1   155   99  MTL
 6: 06/19/2016  2    84   55   NY
 7: 06/20/2016  2    83   55   NY
 8: 06/21/2016  2    80   56   NY
 9: 06/22/2016  2   777   57   QC
10: 06/23/2016  2   555   58   QC

When you have many columns in both datasets, it is easier to use mget instead off typing all the column names. For the used data in the question it would look like:

DF1[DF2, on = .(date, id), names(DF2)[3:4] := mget(paste0("i.", names(DF2)[3:4]))]

When you want to construct a vector of columnnames that need to be added beforehand, you could do this as follows:

cols <- names(DF2)[3:4]
DF1[DF2, on = .(date, id), (cols) := mget(paste0("i.", cols))]
like image 186
Jaap Avatar answered Sep 20 '22 01:09

Jaap


df <- merge(DF1, DF2, by = c("date", "id"), all.x=TRUE)

tmp1 <- df[is.na(df$sales.y) & is.na(df$city.y),]
tmp1$sales.y <- NULL
tmp1$city.y <- NULL
names(tmp1)[names(tmp1) == "sales.x"] <- "sales"
names(tmp1)[names(tmp1) == "city.x"] <- "city"

tmp2 <- df[!is.na(df$sales.y) & !is.na(df$city.y),]
tmp2$sales.x <- NULL
tmp2$city.x <- NULL
names(tmp2)[names(tmp2) == "sales.y"] <- "sales"
names(tmp2)[names(tmp2) == "city.y"] <- "city"

results <- rbindlist(list(tmp1,tmp2), use.names= TRUE, fill = TRUE)

See the result

like image 39
Moun Avatar answered Sep 24 '22 01:09

Moun