In Excel, it is easy to grab a cell within a column and drag the cursor downward to replace many cells below so that each cell becomes the same value as the original.
This function can be performed in R using a for loop. I spent some time trying to figure it out today, and thought I'd share for the benefit of the next person in my shoes:
for (row in 2:length(data$column)){ # 2 so you don't affect column names
if(data$column[row] == "") { # if its empty...
data$column[row] = data$column[row-1] # ...replace with previous row's value
}
}
This worked for me, although it took a long time (5-10 mins) to run with a huge data file. Perhaps there is a more efficient way of achieving this function, and I encourage anyone to say how that could be done.
Thanks and good luck.
Borrowing agstudy's MWE:
library(dplyr)
library(zoo)
N = 1e6
df <- data.frame(x = sample(c(NA,"A","B"), size=N, replace=TRUE))
system.time(test <- df %>% dplyr::do(zoo::na.locf(.)))
user system elapsed
0.082 0.000 0.130
df <- data.frame(a = c(1:5, "", 3, "", "", "", 4), stringsAsFactors = FALSE)
> df
a
1 1
2 2
3 3
4 4
5 5
6
7 3
8
9
10
11 4
while(length(ind <- which(df$a == "")) > 0){
df$a[ind] <- df$a[ind -1]
}
> df
a
1 1
2 2
3 3
4 4
5 5
6 5
7 3
8 3
9 3
10 3
11 4
EDIT: added time profile
set.seed(1)
N = 1e6
df <- data.frame(a = sample(c("",1,2),size=N,replace=TRUE),
stringsAsFactors = FALSE)
if(df$a[1] == "") {df$a[1] <- NA}
system.time(
while(length(ind <- which(df$a == "")) > 0){
df$a[ind] <- df$a[ind - 1]
}, gcFirst = TRUE)
user system elapsed
0.89 0.00 0.88
Here fast solution using na.locf
from the zoo
package applied within data.table
. I created a new column y in the result to better visualize the effect of replacing missing values( easy to repalce x column here). Since na.locf
replaced missing values , an extra step was needed to replace all zero length values by NA
. The solution is very fast and takes less than half second in my machine for 1e6 rows.
library(data.table)
library(zoo)
N=1e6 ## number of rows
DT <- data.table(x=sample(c("",1,2),size=N,replace=TRUE))
system.time(DT[!nzchar(x),x:=NA][,y:=na.locf(x)])
## user system elapsed
## 0.59 0.30 1.78
# x y
# 1: 2 2
# 2: NA 2
# 3: NA 2
# 4: 1 1
# 5: 1 1
# ---
# 999996: 1 1
# 999997: 2 2
# 999998: 2 2
# 999999: NA 2
# 1000000: NA 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With