Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to compress/delete rows with conditions in R?

I have a data frame, please see below. How do I compress/delete rows with some conditions? I would like to choose the highest Volume only where there's a continuous 1 in V1

eg: The V1 of df[2:5,] all equals to 1, the highest Volume out of these is df[4,] so delete df[c(2,3,5),] , for df[9:10,] , highest Volume of these two is df[10,] , delete df[9,].....so on....and the result will be like Example Picture where V1 will be like 0101010101....

How to achieve this without using loops, how do I achieve this the vectorized way so calculation speed is faster(when dealing with millions of rows)?

Update: I apply the same loop with V2 but only delete when Volume is less , that's why row 13 got deleted in the Example Picture because the Volume of Row 13 is less than the Volume of Row 14.

Update(another question): I have tried akrun's method, where I choose the maximum Volume from the group, but when there're two or more maximum, there will still be a continuous 1 just like below row 9. How do I remove the duplicate row? I could use duplicated(df$Volume) , is there another way? Thanks

Update: Follow up with akrun's attempt, the code will delete the rows where V1==0 which is not what I'm after, I would like to delete only when V1==1 and the Volume is smaller than the maximum Volume of that group. I know this will have some continuous 0s in V1, and to delete the duplicated 0s in V1 will depend on the Weight so that's why I need to delete V1==1 only and cannot delete V1==0

#   Volume Weight V1 V2
#1  0.5367 0.5367  0  1
#4  1.1457 1.1413  1  0
#6  0.5694 0.5633  0  1
#7  1.2368 1.2343  1  0
#8  0.9662 0.9593  0  1
#9  1.4102 1.3923  1  0
#10 1.4102 1.3995  1  0
#11 1.1132 1.1069  0  1
#12 1.4535 1.3923  1  0
#14 1.1475 1.1447  0  1
#15 1.1790 1.1748  1  0
#18 1.1557 1.1552  0  1

Data-frame df

    Volume Weight V1 V2 
 1: 0.5367 0.5367  0  1
 2: 0.8645 0.8508  1  0
 3: 0.8590 0.8585  1  0
 4: 1.1457 1.1413  1  0
 5: 0.8573 0.8568  1  0
 6: 0.5694 0.5633  0  1
 7: 1.2368 1.2343  1  0
 8: 0.9662 0.9593  0  1
 9: 1.3550 1.3412  1  0
10: 1.4102 1.3995  1  0
11: 1.1132 1.1069  0  1
12: 1.4535 1.3923  1  0
13: 1.0437 1.0344  0  1
14: 1.1475 1.1447  0  1
15: 1.1790 1.1748  1  0
16: 1.1749 1.1735  1  0
17: 1.1749 1.1731  1  0
18: 1.1557 1.1552  0  1

Example Picture

    Volume Weight V1 V2 
 1: 0.5367 0.5367  0  1
 4: 1.1457 1.1413  1  0
 6: 0.5694 0.5633  0  1
 7: 1.2368 1.2343  1  0
 8: 0.9662 0.9593  0  1
10: 1.4102 1.3995  1  0
11: 1.1132 1.1069  0  1
12: 1.4535 1.3923  1  0
14: 1.1475 1.1447  0  1
15: 1.1790 1.1748  1  0
18: 1.1557 1.1552  0  1
like image 879
Jimmy Avatar asked Mar 09 '23 00:03

Jimmy


2 Answers

You could use library(data.table):

setDT(df)[, .SD[(Volume == max(Volume) & V1 == 1) | V1 != 0], by = rleid(df$V1)][]

Edit:

Concerning the column dropping problem, the adapted trick from akrun:

setDT(df)[df[, .I[(Volume == max(Volume) & V1 == 1) | V1 == 0], rleid(V1)]$V1][]
like image 100
Tonio Liebrand Avatar answered Mar 11 '23 12:03

Tonio Liebrand


We can use rle from base R

grp <- inverse.rle(within.list(rle(df$V1), values <- seq_along(values)))
df[with(df, ave(Volume, grp, FUN = max)==Volume),]
#   Volume Weight V1 V2
#1  0.5367 0.5367  0  1
#4  1.1457 1.1413  1  0
#6  0.5694 0.5633  0  1
#7  1.2368 1.2343  1  0
#8  0.9662 0.9593  0  1
#10 1.4102 1.3995  1  0
#11 1.1132 1.1069  0  1
#12 1.4535 1.3923  1  0
#14 1.1475 1.1447  0  1
#15 1.1790 1.1748  1  0
#18 1.1557 1.1552  0  1

NOTE: We used a data.frame and not data.table as input data

like image 34
akrun Avatar answered Mar 11 '23 13:03

akrun