If one wants to fill in missing values of a variable based on previous/posterior non NA observation within a group, the data.table command is
setkey(DT,id,date)
DT[, value_filled_in := DT[!is.na(value), list(id, date, value)][DT[, list(id, date)], value, roll = TRUE]]
which is quite complex. It's a shame since roll
is a very fast and powerful option (esp compared with applying a function such as zoo::na.locf
within each group)
I can write a convenience function to fill in missing values
fill_na <- function(x , by = NULL, roll =TRUE , rollends= if (roll=="nearest") c(TRUE,TRUE)
else if (roll>=0) c(FALSE,TRUE)
else c(TRUE,FALSE)){
id <- seq_along(x)
if (is.null(by)){
DT <- data.table("x" = x, "id" = id, key = "id")
return(DT[!is.na(x)][DT[, list(id)], x, roll = roll, rollends = rollends, allow.cartesian = TRUE])
} else{
DT <- data.table("x" = x, "by" = by, "id" = id, key = c("by", "id"))
return(DT[!is.na(x)][DT[, list(by, id)], x, roll = roll, rollends = rollends, allow.cartesian = TRUE])
}
}
And then write
setkey(DT,id, date)
DT[, value_filled_in := fill_na(value, by = id)]
This is not really satisfying since one would like to write
setkey(DT,id, date)
DT[, value_filled_in := fill_na(value), by = id]
However, this takes a huge amount of time to run. And, for the end-user, it is cumbersome to learn that fill_na
should be called with the by
option, and should not be used with data.table
by
. Is there an elegant solution around this?
Some speed test
N <- 2e6
set.seed(1)
DT <- data.table(
date = sample(10, N, TRUE),
id = sample(1e5, N, TRUE),
value = sample(c(NA,1:5), N, TRUE),
value2 = sample(c(NA,1:5), N, TRUE)
)
setkey(DT,id,date)
DT<- unique(DT)
system.time(DT[, filled0 := DT[!is.na(value), list(id, date, value)][DT[, list(id, date)], value, roll = TRUE]])
#> user system elapsed
#> 0.086 0.006 0.105
system.time(DT[, filled1 := zoo::na.locf.default(value, na.rm = FALSE), by = id])
#> user system elapsed
#> 5.235 0.016 5.274
# (lower speed and no built in option like roll=integer or roll=nearest, rollend, etc)
system.time(DT[, filled2 := fill_na(value, by = id)])
#> user system elapsed
#> 0.194 0.019 0.221
system.time(DT[, filled3 := fill_na(value), by = id])
#> user system elapsed
#> 237.256 0.913 238.405
Why don't I just use na.locf.default
? Even though the speed difference is not really important, the same issue arises for other kinds of data.table commands (those that rely on a merge by the variable in "by") - it's a shame to systematically ignore them in order to get an easier syntax. I also really like all the roll options.
Use the fillna() Method: The fillna() function iterates through your dataset and fills all null rows with a specified value. It accepts some optional arguments—take note of the following ones: Value: This is the value you want to insert into the missing rows. Method: Lets you fill missing values forward or in reverse.
Right-click Variable with missing values in the table preview on the canvas pane and select Categories and Totals from the pop-up menu. Click (check) Missing Values in the Categories and Totals dialog box, and then click Apply. Now the table preview includes a Missing Values category.
Think of .N as a variable for the number of instances. For example: dt <- data.table(a = LETTERS[c(1,1:3)], b = 4:7) dt[.N] # returns the last row # a b # 1: C 7.
First, if we want to exclude missing values from mathematical operations use the na. rm = TRUE argument. If you do not exclude these values most functions will return an NA . We may also desire to subset our data to obtain complete observations, those observations (rows) in our data that contain no missing data.
How I can fill the columns with missing pieces of information (article number, article name) based on the Source Data, previous ranking period Same columns in both tables Same columns in both tables Same columns in both tables Missing info: Article-nr and Article - same as on photo 1 same values in other columnes between those two tables.
Filling missing data with a mean or median value is applicable when the columns involved have integer or float data types. You can also fill missing data with the mode value, which is the most occurring value. This is also applicable to integers or floats.
You can fix missing data by either dropping or filling them with other values. In this article, we'll explain and explore the different ways to fill missing data using pandas. 1. Use the fillna () Method:
Instead of filling missing values, we sometimes need to replace the data with missing values. This might be required in situations when missing values are coded with a number or the actual values are not useful or sensible for the data study.
Here's a slightly faster and more compact way of doing it (version 1.9.3+):
DT[, filled4 := DT[!is.na(value)][DT, value, roll = T]]
There is now a native data.table
way of filling missing values (as of 1.12.4
).
This question spawned a github issue which was recently closed with the creation of functions nafill
and setnafill
. You can now use
DT[, value_filled_in := nafill(value, type = "locf")]
It is also possible to fill NA
with a constant value or next observation carried back.
One difference to the approach in the question is that these functions currently only work on NA
not NaN
whereas is.na
is TRUE
for NaN
- this is planned to be fixed in the next release through an extra argument.
I have no involvement with the project but I saw that although the github issue links here, there was no link the other way so I'm answering on behalf of future visitors.
Update: By default NaN
is now treated same as NA
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With