Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In R how to fill in numbers that do not appear in the sequence?

Tags:

r

seq

I have a data set that list the percentiles for a set of scores like this:

 > percentiles
    Score Percentile
1     231          0
2     385          1
3     403          2
4     413          3
5     418          4
6     424          5
7     429          6
8     434          7
9     437          8
10    441          9
11    443         10

I would like the "Score" column to run from 100 to 500. That is, I would like Scores 100 to 231 to be associated with a Percentile of 0, Scores 232 to 385 to be associated with a Percentile of 1, etc. Is there a simple way to fill in the values that do not appear in the sequence of "Score" values so it looks like the below data set?

> percentiles
    Score Percentile
1     100          0
2     101          0
3     102          0
4     103          0
5     104          0
6     105          0
7     106          0
8     107          0
9     108          0
10    109          0
--------------------
130   229          0
131   230          0
132   231          0
133   232          1
134   233          1
135   234          1
136   235          1
137   236          1
138   237          1
139   238          1
140   239          1


like image 515
Kate N Avatar asked Jan 01 '23 10:01

Kate N


2 Answers

If you convert percentiles to a data.table, you could do a rolling join with a new table of all scores 100:500. The rolling join with roll = -Inf gives a fill-backward behavior by itself, but still the 444:500 values are NA so a forward nafill is added at the end.

library(data.table)
setDT(percentiles)

percentiles[data.table(Score = 100:500), on = .(Score), roll = -Inf
            ][, Percentile := nafill(Percentile, 'locf')]


#      Score Percentile
#   1:   100          0
#   2:   101          0
#   3:   102          0
#   4:   103          0
#   5:   104          0
#  ---                 
# 397:   496         10
# 398:   497         10
# 399:   498         10
# 400:   499         10
# 401:   500         10
like image 82
IceCreamToucan Avatar answered May 16 '23 08:05

IceCreamToucan


You might think about this differently: instead of a data frame to fill, as a set of breaks for binning your scores. Use the scores as the breaks with -Inf tacked on to have the lower bound. If you need something different to happen for the scores above the highest break, add Inf to the end of the breaks, but you'll need to come up with an additional label.

library(dplyr)

dat <- data.frame(Score = 100:500) %>%
  mutate(Percentile = cut(Score, breaks = c(-Inf, percentiles$Score), 
                          labels = percentiles$Percentile, 
                          right = T, include.lowest = F)) 

Taking a look at a few of the breaking points:

slice(dat, c(129:135, 342:346))
#>    Score Percentile
#> 1    228          0
#> 2    229          0
#> 3    230          0
#> 4    231          0
#> 5    232          1
#> 6    233          1
#> 7    234          1
#> 8    441          9
#> 9    442         10
#> 10   443         10
#> 11   444       <NA>
#> 12   445       <NA>
like image 41
camille Avatar answered May 16 '23 08:05

camille