Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clean R data frame so that in a column no row value is bigger than 2 times next row value

I have a data frame exemplified by the following

dist <- c(1.1,1.0,10.0,5.0,2.1,12.2,3.3,3.4)
id <- rep("A",length(dist))
df<-cbind.data.frame(id,dist)

df

  id dist
1  A  1.1
2  A  1.0
3  A 10.0
4  A  5.0
5  A  2.1
6  A 12.2
7  A  3.3
8  A  3.4

I need to clean it up so no row values in the dist column is bigger than 2 times the next row value at any time. A cleaned up data frame would look like this:

  id dist
1  A  1.1
2  A  1.0
5  A  2.1
7  A  3.3
8  A  3.4

I have tried making a function with a for loop and if statement to clean it

cleaner <-  function (df,dist,times_larger) {

              for (i in 1:(nrow(df)-1)) {

                  if (df$dist[i] > df$dist[i+1]*times_larger){
                    df<-df[-i,]
                    break       
                  }
              }
              df
            }

Obviously if I dont break the loop it will create an error because the number of rows in df will change in the process. If I manually run the loop on df several times:

df<-cleaner(df,"dist",2)

it will clean up as I want.

I have also tried different function constructions and applying it to the data frame with apply, but without any luck.

Do any have a good suggestion of either how to repeat the function on the data frame until it does not change anymore, a better function structure or maybe a better way of cleaning?

Any suggestions are most appreciated

like image 660
Kristian Avatar asked Jan 29 '15 17:01

Kristian


2 Answers

You can shift your dist column one element left, multiply it by two, and compare with the original dist:

subset(df,dist < c(2*dist[-1],Inf))
#  id dist
#1  A  1.1
#2  A  1.0
#5  A  2.1
#7  A  3.3
#8  A  3.4
like image 104
Marat Talipov Avatar answered Sep 20 '22 15:09

Marat Talipov


You could try lead from dplyr

library(dplyr) #dplyr_0.4.0
filter(df, dist < 2 * lead(dist, default = Inf)) 
#    id dist
#1  A  1.1
#2  A  1.0
#3  A  2.1
#4  A  3.3
#5  A  3.4

Or using the similar method in data.table. A new function shift is introduced in the devel version of data.table. We can specify the type to lead. By default, it is lag and fill is NA. Modify the fill to 'Inf' (inspired from @Marat Talipov's post).

library(data.table) #data.table_1.9.5
setDT(df)[dist <2 *shift(dist,type='lead', fill=Inf)]
#   id dist
#1:  A  1.1
#2:  A  1.0
#3:  A  2.1
#4:  A  3.3
#5:  A  3.4

Update

If the value of 'dist' is equal to '2' times the next value, the above solutions removes that row. In such cases,

setDT(df)[dist <2 *(shift(dist,type='lead',
             fill=Inf)+.Machine$double.eps)]
#    id dist
#1:  A  1.1
#2:  A  1.0
#3:  A  2.1
#4:  A  3.3
#5:  A  3.4

Using a different example as commented by @Henrik.

df1 <- data.frame(dist= as.numeric(3:1))
setDT(df1)[dist <2 *(shift(dist,type='lead', 
            fill=Inf)+.Machine$double.eps)]
#    dist
#1:    3
#2:    2
#3:    1

Benchmarks

set.seed(49)
df <- data.frame(id='A', dist=rnorm(1e7,20))
df1 <- copy(df)
akrun1 <- function() {filter(df, dist < 2 * lead(dist,
                                 default = Inf)) }
akrun2 <- function() {setDT(df1)[dist <2 *shift(dist,type='lead',
                                     fill=Inf)]}
marat <- function() {subset(df,dist < c(2*dist[-1],Inf))}
Colonel <- function() {df[with(df, dist<2*c(dist[-1], tail(dist,1))),]}

library(microbenchmark)
microbenchmark(akrun1(), akrun2(), marat(), Colonel(), 
                                unit='relative', times=20L)
#Unit: relative
#    expr      min       lq     mean   median       uq      max neval  cld
# akrun1() 2.029087 1.990739 1.864697 1.965247 1.773722 1.727474    20  b  
# akrun2() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    20  a   
# marat() 8.032147 8.137982 7.359821 7.937062 7.134686 5.837623     20  d
#Colonel() 7.094465 7.045000 6.473552 6.903460 6.197737 5.359575    20  c 
like image 30
akrun Avatar answered Sep 17 '22 15:09

akrun