In Excel, it is easy to grab a cell within a column and drag the cursor downward to replace many cells below so that each cell becomes the same value as the original.
This function can be performed in R using a for loop. I spent some time trying to figure it out today, and thought I'd share for the benefit of the next person in my shoes:
for (row in 2:length(data$column)){ # 2 so you don't affect column names
if(data$column[row] == "") { # if its empty...
data$column[row] = data$column[row-1] # ...replace with previous row's value
}
}
This worked for me, although it took a long time (5-10 mins) to run with a huge data file. Perhaps there is a more efficient way of achieving this function, and I encourage anyone to say how that could be done.
Thanks and good luck.
Borrowing agstudy's MWE:
library(dplyr)
library(zoo)
N = 1e6
df <- data.frame(x = sample(c(NA,"A","B"), size=N, replace=TRUE))
system.time(test <- df %>% dplyr::do(zoo::na.locf(.)))
user system elapsed
0.082 0.000 0.130
df <- data.frame(a = c(1:5, "", 3, "", "", "", 4), stringsAsFactors = FALSE)
> df
a
1 1
2 2
3 3
4 4
5 5
6
7 3
8
9
10
11 4
while(length(ind <- which(df$a == "")) > 0){
df$a[ind] <- df$a[ind -1]
}
> df
a
1 1
2 2
3 3
4 4
5 5
6 5
7 3
8 3
9 3
10 3
11 4
EDIT: added time profile
set.seed(1)
N = 1e6
df <- data.frame(a = sample(c("",1,2),size=N,replace=TRUE),
stringsAsFactors = FALSE)
if(df$a[1] == "") {df$a[1] <- NA}
system.time(
while(length(ind <- which(df$a == "")) > 0){
df$a[ind] <- df$a[ind - 1]
}, gcFirst = TRUE)
user system elapsed
0.89 0.00 0.88
Here fast solution using na.locf from the zoo package applied within data.table. I created a new column y in the result to better visualize the effect of replacing missing values( easy to repalce x column here). Since na.locf replaced missing values , an extra step was needed to replace all zero length values by NA. The solution is very fast and takes less than half second in my machine for 1e6 rows.
library(data.table)
library(zoo)
N=1e6 ## number of rows
DT <- data.table(x=sample(c("",1,2),size=N,replace=TRUE))
system.time(DT[!nzchar(x),x:=NA][,y:=na.locf(x)])
## user system elapsed
## 0.59 0.30 1.78
# x y
# 1: 2 2
# 2: NA 2
# 3: NA 2
# 4: 1 1
# 5: 1 1
# ---
# 999996: 1 1
# 999997: 2 2
# 999998: 2 2
# 999999: NA 2
# 1000000: NA 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With