First, let's start with DataFrame 1 (DF1) :
DF1 <- data.frame(c("06/19/2016", "06/20/2016", "06/21/2016", "06/22/2016",
"06/23/2016", "06/19/2016", "06/20/2016", "06/21/2016",
"06/22/2016", "06/23/2016"),
c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2),
c(149, 150, 151, 152, 155, 84, 83, 80, 81, 97),
c(101, 102, 104, 107, 99, 55, 55, 56, 57, 58),
c("MTL", "MTL", "MTL", "MTL", "MTL", "NY", "NY",
"NY", "NY", "NY"))
colnames(DF1) <- c("date", "id", "sales", "cost", "city")
I also have DataFrame 2 (DF2) :
DF2 <- data.frame(c("06/19/2016", "06/27/2016", "06/22/2016", "06/23/2016"),
c(1, 1, 2, 2),
c(9999, 8888, 777, 555),
c("LON", "LON", "QC", "QC"))
colnames(DF2) <- c("date", "id", "sales", "city")
For every rows in DF1, I have to look if there is a row in DF2 that has the same date and id. If yes, I have to replace the values in DF1 by the values in DF2.
DF2 will always have less columns than DF1. If a column is not in DF2, I must keep the original value that was in DF1 for that specific column.
The final output would like this:
results <- data.frame(c("06/19/2016", "06/20/2016", "06/21/2016", "06/22/2016",
"06/23/2016", "06/19/2016", "06/20/2016", "06/21/2016",
"06/22/2016", "06/23/2016"),
c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2),
c(9999, 150, 151, 152, 155, 84, 83, 80, 777, 555),
c(101, 102, 104, 107, 99, 55, 55, 56, 57, 58),
c("LON", "MTL", "MTL", "MTL", "MTL", "NY", "NY",
"NY", "QC", "QC"))
colnames(results) <- c("date", "id", "sales", "cost", "city")
Do you have any suggestions?
You can replace values of all or selected columns based on the condition of pandas DataFrame by using DataFrame. loc[ ] property. The loc[] is used to access a group of rows and columns by label(s) or a boolean array. It can access and can also manipulate the values of pandas DataFrame.
In order to replace a value in Pandas DataFrame, use the replace() method with the column the from and to values.
You can do update a PySpark DataFrame Column using withColum(), select() and sql(), since DataFrame's are distributed immutable collection you can't really change the column values however when you change the value using withColumn() or any approach, PySpark returns a new Dataframe with updated values.
You could use the join functionality of the data.table-package for this:
library(data.table)
setDT(DF1)
setDT(DF2)
DF1[DF2, on = .(date, id), `:=` (city = i.city, sales = i.sales)]
which gives:
> DF1 date id sales cost city 1: 06/19/2016 1 9999 101 LON 2: 06/20/2016 1 150 102 MTL 3: 06/21/2016 1 151 104 MTL 4: 06/22/2016 1 152 107 MTL 5: 06/23/2016 1 155 99 MTL 6: 06/19/2016 2 84 55 NY 7: 06/20/2016 2 83 55 NY 8: 06/21/2016 2 80 56 NY 9: 06/22/2016 2 777 57 QC 10: 06/23/2016 2 555 58 QC
When you have many columns in both datasets, it is easier to use mget
instead off typing all the column names. For the used data in the question it would look like:
DF1[DF2, on = .(date, id), names(DF2)[3:4] := mget(paste0("i.", names(DF2)[3:4]))]
When you want to construct a vector of columnnames that need to be added beforehand, you could do this as follows:
cols <- names(DF2)[3:4]
DF1[DF2, on = .(date, id), (cols) := mget(paste0("i.", cols))]
df <- merge(DF1, DF2, by = c("date", "id"), all.x=TRUE)
tmp1 <- df[is.na(df$sales.y) & is.na(df$city.y),]
tmp1$sales.y <- NULL
tmp1$city.y <- NULL
names(tmp1)[names(tmp1) == "sales.x"] <- "sales"
names(tmp1)[names(tmp1) == "city.x"] <- "city"
tmp2 <- df[!is.na(df$sales.y) & !is.na(df$city.y),]
tmp2$sales.x <- NULL
tmp2$city.x <- NULL
names(tmp2)[names(tmp2) == "sales.y"] <- "sales"
names(tmp2)[names(tmp2) == "city.y"] <- "city"
results <- rbindlist(list(tmp1,tmp2), use.names= TRUE, fill = TRUE)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With