I wanted to fill some NA values in a data.table without groups. Please consider this extract of data.table representing time and distances: <pre class="prettyprint"><code>library(data.table) df <- data.frame(time = seq(7173, 7195, 1), dist = c(31091.33, NA, 31100.00, 31103.27, NA, NA, NA, NA, 31124.98, NA,31132.81, NA, NA, NA, NA, 31154.19, NA, 31161.47, NA, NA, NA, NA, 31182.97)) DT<- data.table(df) </code></pre> I want in the DT data.table, to fill NA values with a function depending on non-NA value before/after. As an example, writing a function in j to replace each instruction <pre class="prettyprint"><code>DT[2, dist := (31091.33 + (31100-31091.33) / 2)] </code></pre> then <pre class="prettyprint"><code>DT[5:8, dist := (31103.27 + "something" * (31124.98 - 31103.27) / 5)] </code></pre> etc...

The code is explained inline. You can delete the temporary columns using <code>df[,dist_before := NULL]</code>, for example. <pre class="prettyprint lang-r prettyprint-override"><code>library(data.table) df=data.table(time=seq(7173,7195,1),dist=c(31091.33,NA,31100.00,31103.27,NA,NA,NA, NA,31124.98,NA,31132.81,NA,NA,NA,NA,31154.19,NA,31161.47,NA,NA,NA,NA,31182.97)) df #> time dist #> 1: 7173 31091.33 #> 2: 7174 NA #> 3: 7175 31100.00 #> 4: 7176 31103.27 #> 5: 7177 NA #> 6: 7178 NA #> 7: 7179 NA #> 8: 7180 NA #> 9: 7181 31124.98 #> 10: 7182 NA #> 11: 7183 31132.81 #> 12: 7184 NA #> 13: 7185 NA #> 14: 7186 NA #> 15: 7187 NA #> 16: 7188 31154.19 #> 17: 7189 NA #> 18: 7190 31161.47 #> 19: 7191 NA #> 20: 7192 NA #> 21: 7193 NA #> 22: 7194 NA #> 23: 7195 31182.97 #> time dist # Carry forward the last non-missing observation df[,dist_before := nafill(dist, "locf")] # Bring back the next non-missing dist df[,dist_after := nafill(dist, "nocb")] # rleid will create groups based on run-lengths of values within the data. # This means 4 NA's in a row will be grouped together, for example. # We then count the missings and add 1, because we want the # last NA before the next non-missing to be less than the non-missing value. df[, rle := rleid(dist)][,missings := max(.N + 1 , 2), by = rle][] #> time dist dist_before dist_after rle missings #> 1: 7173 31091.33 31091.33 31091.33 1 2 #> 2: 7174 NA 31091.33 31100.00 2 2 #> 3: 7175 31100.00 31100.00 31100.00 3 2 #> 4: 7176 31103.27 31103.27 31103.27 4 2 #> 5: 7177 NA 31103.27 31124.98 5 5 #> 6: 7178 NA 31103.27 31124.98 5 5 #> 7: 7179 NA 31103.27 31124.98 5 5 #> 8: 7180 NA 31103.27 31124.98 5 5 #> 9: 7181 31124.98 31124.98 31124.98 6 2 #> 10: 7182 NA 31124.98 31132.81 7 2 #> 11: 7183 31132.81 31132.81 31132.81 8 2 #> 12: 7184 NA 31132.81 31154.19 9 5 #> 13: 7185 NA 31132.81 31154.19 9 5 #> 14: 7186 NA 31132.81 31154.19 9 5 #> 15: 7187 NA 31132.81 31154.19 9 5 #> 16: 7188 31154.19 31154.19 31154.19 10 2 #> 17: 7189 NA 31154.19 31161.47 11 2 #> 18: 7190 31161.47 31161.47 31161.47 12 2 #> 19: 7191 NA 31161.47 31182.97 13 5 #> 20: 7192 NA 31161.47 31182.97 13 5 #> 21: 7193 NA 31161.47 31182.97 13 5 #> 22: 7194 NA 31161.47 31182.97 13 5 #> 23: 7195 31182.97 31182.97 31182.97 14 2 #> time dist dist_before dist_after rle missings # .SD[,.I] will get us the row number relative to the group it is in. # For example, row 5 dist is calculated as # dist_before + 1 * (dist_after - dist_before)/5 df[is.na(dist), dist := dist_before + .SD[,.I] * (dist_after - dist_before)/(missings), by = rle] df[] #> time dist dist_before dist_after rle missings #> 1: 7173 31091.33 31091.33 31091.33 1 2 #> 2: 7174 31095.67 31091.33 31100.00 2 2 #> 3: 7175 31100.00 31100.00 31100.00 3 2 #> 4: 7176 31103.27 31103.27 31103.27 4 2 #> 5: 7177 31107.61 31103.27 31124.98 5 5 #> 6: 7178 31111.95 31103.27 31124.98 5 5 #> 7: 7179 31116.30 31103.27 31124.98 5 5 #> 8: 7180 31120.64 31103.27 31124.98 5 5 #> 9: 7181 31124.98 31124.98 31124.98 6 2 #> 10: 7182 31128.90 31124.98 31132.81 7 2 #> 11: 7183 31132.81 31132.81 31132.81 8 2 #> 12: 7184 31137.09 31132.81 31154.19 9 5 #> 13: 7185 31141.36 31132.81 31154.19 9 5 #> 14: 7186 31145.64 31132.81 31154.19 9 5 #> 15: 7187 31149.91 31132.81 31154.19 9 5 #> 16: 7188 31154.19 31154.19 31154.19 10 2 #> 17: 7189 31157.83 31154.19 31161.47 11 2 #> 18: 7190 31161.47 31161.47 31161.47 12 2 #> 19: 7191 31165.77 31161.47 31182.97 13 5 #> 20: 7192 31170.07 31161.47 31182.97 13 5 #> 21: 7193 31174.37 31161.47 31182.97 13 5 #> 22: 7194 31178.67 31161.47 31182.97 13 5 #> 23: 7195 31182.97 31182.97 31182.97 14 2 #> time dist dist_before dist_after rle missings </code></pre>

You can use the <code>approx</code> function to do linear interpolation. For each group of <code>NA</code>s, get that subset of <code>DT</code> plus the rows before and after. Then apply <code>approx</code> to this subset of the <code>dist</code> vector, with the <code>n</code> argument of <code>approx</code> equal to the number of rows in the subset <code>.N</code>. <pre class="prettyprint"><code>DT[, g := rleid(dist)] DT[is.na(dist), dist := { i <- .I[c(1, .N)] + c(-1, 1) DT[i[1]:i[2], approx(dist, n = .N)$y[-c(1, .N)]] }, by = g] </code></pre> Or, without <code>approx</code> <pre class="prettyprint"><code>DT[, g := rleid(dist)] DT[is.na(dist), dist := { i <- .I[c(1, .N)] + c(-1, 1) DT[i[1]:i[2], dist[1] + 1:(.N - 2)*(dist[.N] - dist[1])/(.N - 1)] }, by = g] </code></pre> edit: since this answer was accepted I feel I should point out that other answers are faster and the second part of @dww's answer is basically my first code block but with the unnecessary grouping part removed (so it is simpler and faster).

data.table linearly interpolating NA values without groups

Tags:

r

data.table

I wanted to fill some NA values in a data.table without groups. Please consider this extract of data.table representing time and distances:

library(data.table)
df <- data.frame(time = seq(7173, 7195, 1), dist = c(31091.33, NA, 31100.00, 31103.27, NA, NA, NA, NA, 31124.98, NA,31132.81, NA, NA, NA, NA, 31154.19, NA, 31161.47, NA, NA, NA, NA, 31182.97))
DT<- data.table(df)

I want in the DT data.table, to fill NA values with a function depending on non-NA value before/after. As an example, writing a function in j to replace each instruction

DT[2, dist := (31091.33 + (31100-31091.33) / 2)]

then

DT[5:8, dist := (31103.27 + "something" * (31124.98 - 31103.27) / 5)]

etc...

607

asked Nov 17 '19 14:11

ArnaudR

3 Answers

The code is explained inline. You can delete the temporary columns using df[,dist_before := NULL], for example.

library(data.table)
df=data.table(time=seq(7173,7195,1),dist=c(31091.33,NA,31100.00,31103.27,NA,NA,NA,
NA,31124.98,NA,31132.81,NA,NA,NA,NA,31154.19,NA,31161.47,NA,NA,NA,NA,31182.97))
df
#>     time     dist
#>  1: 7173 31091.33
#>  2: 7174       NA
#>  3: 7175 31100.00
#>  4: 7176 31103.27
#>  5: 7177       NA
#>  6: 7178       NA
#>  7: 7179       NA
#>  8: 7180       NA
#>  9: 7181 31124.98
#> 10: 7182       NA
#> 11: 7183 31132.81
#> 12: 7184       NA
#> 13: 7185       NA
#> 14: 7186       NA
#> 15: 7187       NA
#> 16: 7188 31154.19
#> 17: 7189       NA
#> 18: 7190 31161.47
#> 19: 7191       NA
#> 20: 7192       NA
#> 21: 7193       NA
#> 22: 7194       NA
#> 23: 7195 31182.97
#>     time     dist
# Carry forward the last non-missing observation
df[,dist_before := nafill(dist, "locf")]
# Bring back the next non-missing dist
df[,dist_after := nafill(dist, "nocb")]
# rleid will create groups based on run-lengths of values within the data.
# This means 4 NA's in a row will be grouped together, for example.
# We then count the missings and add 1, because we want the 
# last NA before the next non-missing to be less than the non-missing value.
df[, rle := rleid(dist)][,missings := max(.N +  1 , 2), by = rle][]
#>     time     dist dist_before dist_after rle missings
#>  1: 7173 31091.33    31091.33   31091.33   1        2
#>  2: 7174       NA    31091.33   31100.00   2        2
#>  3: 7175 31100.00    31100.00   31100.00   3        2
#>  4: 7176 31103.27    31103.27   31103.27   4        2
#>  5: 7177       NA    31103.27   31124.98   5        5
#>  6: 7178       NA    31103.27   31124.98   5        5
#>  7: 7179       NA    31103.27   31124.98   5        5
#>  8: 7180       NA    31103.27   31124.98   5        5
#>  9: 7181 31124.98    31124.98   31124.98   6        2
#> 10: 7182       NA    31124.98   31132.81   7        2
#> 11: 7183 31132.81    31132.81   31132.81   8        2
#> 12: 7184       NA    31132.81   31154.19   9        5
#> 13: 7185       NA    31132.81   31154.19   9        5
#> 14: 7186       NA    31132.81   31154.19   9        5
#> 15: 7187       NA    31132.81   31154.19   9        5
#> 16: 7188 31154.19    31154.19   31154.19  10        2
#> 17: 7189       NA    31154.19   31161.47  11        2
#> 18: 7190 31161.47    31161.47   31161.47  12        2
#> 19: 7191       NA    31161.47   31182.97  13        5
#> 20: 7192       NA    31161.47   31182.97  13        5
#> 21: 7193       NA    31161.47   31182.97  13        5
#> 22: 7194       NA    31161.47   31182.97  13        5
#> 23: 7195 31182.97    31182.97   31182.97  14        2
#>     time     dist dist_before dist_after rle missings
# .SD[,.I] will get us the row number relative to the group it is in. 
# For example, row 5 dist is calculated as
# dist_before + 1 * (dist_after - dist_before)/5
df[is.na(dist), dist := dist_before + .SD[,.I] *
                     (dist_after - dist_before)/(missings), by = rle]
df[]
#>     time     dist dist_before dist_after rle missings
#>  1: 7173 31091.33    31091.33   31091.33   1        2
#>  2: 7174 31095.67    31091.33   31100.00   2        2
#>  3: 7175 31100.00    31100.00   31100.00   3        2
#>  4: 7176 31103.27    31103.27   31103.27   4        2
#>  5: 7177 31107.61    31103.27   31124.98   5        5
#>  6: 7178 31111.95    31103.27   31124.98   5        5
#>  7: 7179 31116.30    31103.27   31124.98   5        5
#>  8: 7180 31120.64    31103.27   31124.98   5        5
#>  9: 7181 31124.98    31124.98   31124.98   6        2
#> 10: 7182 31128.90    31124.98   31132.81   7        2
#> 11: 7183 31132.81    31132.81   31132.81   8        2
#> 12: 7184 31137.09    31132.81   31154.19   9        5
#> 13: 7185 31141.36    31132.81   31154.19   9        5
#> 14: 7186 31145.64    31132.81   31154.19   9        5
#> 15: 7187 31149.91    31132.81   31154.19   9        5
#> 16: 7188 31154.19    31154.19   31154.19  10        2
#> 17: 7189 31157.83    31154.19   31161.47  11        2
#> 18: 7190 31161.47    31161.47   31161.47  12        2
#> 19: 7191 31165.77    31161.47   31182.97  13        5
#> 20: 7192 31170.07    31161.47   31182.97  13        5
#> 21: 7193 31174.37    31161.47   31182.97  13        5
#> 22: 7194 31178.67    31161.47   31182.97  13        5
#> 23: 7195 31182.97    31182.97   31182.97  14        2
#>     time     dist dist_before dist_after rle missings

103

answered Oct 16 '22 17:10

smingerson

You can use the approx function to do linear interpolation.

For each group of NAs, get that subset of DT plus the rows before and after. Then apply approx to this subset of the dist vector, with the n argument of approx equal to the number of rows in the subset .N.

DT[, g := rleid(dist)]

DT[is.na(dist), dist := {
      i <- .I[c(1, .N)] + c(-1, 1)
      DT[i[1]:i[2], approx(dist, n = .N)$y[-c(1, .N)]]
  }, by = g]

Or, without approx

DT[, g := rleid(dist)]

DT[is.na(dist), dist := {
      i <- .I[c(1, .N)] + c(-1, 1)
      DT[i[1]:i[2], dist[1] + 1:(.N - 2)*(dist[.N] - dist[1])/(.N - 1)]
  }, by = g]

edit: since this answer was accepted I feel I should point out that other answers are faster and the second part of @dww's answer is basically my first code block but with the unnecessary grouping part removed (so it is simpler and faster).

answered Oct 16 '22 18:10

IceCreamToucan

Using library(zoo)

DT[, dist := na.approx(dist)]

Alternatively, if you prefer to stick to base R functions rather than use another package, then you can do

DT[, dist := approx(.I, dist, .I)$y]

answered Oct 16 '22 19:10

dww

Related questions
                            
                                createReadStream() throwing RangeError: Maximum call stack size exceeded when uploading file
                            
                                How to debug Deno in VSCode
                            
                                ValueError: Unknown layer: Functional
                            
                                Unexpected results after optimizing switch case in Visual Studio with C#8.0
                            
                                jekyll error building page related to kramdown parser
                            
                                Pydantic: How to pass the default value to a variable if None was passed?
                            
                                'Serilog.Extensions.Hosting.DiagnosticContext' while attempting to activate 'Serilog.AspNetCore.RequestLoggingMiddleware'
                            
                                C#: how to detect repeating values in an array and process them in such a way that each repeating value is only processed once?
                            
                                Error installing ruby 2.6.7 on mac os - how to resolve?
                            
                                How to implement "DOM Ready" event in a GreaseMonkey script?
                            
                                Detecting a chroot jail from within
                            
                                Disable and later enable all table indexes in Oracle

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With