Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Picking rows with specific column matching conditions

Tags:

dataframe

r

I have a data.frame with 2291 rows and 4 columns, and I want to pick those rows whose column 3 match with column 2 of the next row, and start again from the next matched row and end until the matching goes on until it stops.

I tried using a for loop from 1:nrow(df), but this is not exactly accurate as i (I think) doesn't really start from the point of matched row.

My current code is like this:

test <- NULL 
x <- c()
y <- c()

for(i in 1:nrow(df)){
    if(df[i,3]==df[i+1,2]){
        x <- df[i,]
        y <- df[i+1,]
        i = i+1 #stuck at this
    }
    test <- rbind(test,x,y)
}

Sample data looks like this:

X  2670000  3750000    C
X  3830000  8680000   E3
X  8680000 10120000 E1-A
X 10120000 11130079    D
X 11170079 11810079   E3
X 11810079 12810079 E2-A
X 12810079 13530079   E3
X 13530079 14050079   E3
X 14050079 15330079    A
X 15330079 16810079 E2-A
X 16810079 17690079 E2-A

What I want is:

X  3830000  8680000   E3
X  8680000 10120000 E1-A
X 10120000 11130079    D

X 11170079 11810079   E3
X 11810079 12810079 E2-A
X 12810079 13530079   E3
X 13530079 14050079   E3
X 14050079 15330079    A
X 15330079 16810079 E2-A
X 16810079 17690079 E2-A

I'm actually interested in the column 4 values. After such a condition when df[i,3] is not equal to df[i+1,2], can the code be updated to store the column 4 values in vectors?

For example: The result for this sample would be:

vector_1
"E3" "E1-A" "D"

vector_2
"E3" "E2-A" "E3" "E3" "A" "E2-A" "E2-A" 

What I get so far is:

X  3830000  8680000   E3
X  8680000 10120000 E1-A
X  8680000 10120000 E1-A
X 10120000 11130079    D
X  8680000 10120000 E1-A
X 10120000 11130079    D
X 11170079 11810079   E3
X 11810079 12810079 E2-A
X 11810079 12810079 E2-A
X 12810079 13530079   E3

If I go from row 1 to the last row of my df, I want to keep adding column 4 values in a vector as long as column 3 of i matches column 2 of i+1. Once that condition breaks, the next time the same condition is met, I want to keep storing the column 4 values again.

Thank you!

like image 514
rishi Avatar asked Jan 07 '19 16:01

rishi


1 Answers

An easy way is to use the lead function from the dplyr package.

lead(x, n = 1L, default = NA, order_by = NULL, ...) Find the "next" or "previous" values in a vector. Useful for comparing values ahead of or behind the current values.

This also allows you to avoid the for-loop entirely. Since you haven't named your columns in the question, I'll use another example:

library(dplyr)
df <- data.frame(a = 1:5, b = c(2, 999, 4, 5, 999))

print(df) # In this example, we want to keep the 1st, 3rd, and 4th rows.
     a   b
   1 1   2
   2 2 999
   3 3   4
   4 4   5
   5 5 999

matching_df <- df[df$b == dplyr::lead(df$a, 1, default = FALSE), ]
print(matching_df)
      a b
    1 1 2
    3 3 4
    4 4 5

non_matching_df <- df[df$b != dplyr::lead(df$a, 1, default = FALSE), ]
print(non_matching_df)
      a   b
    2 2 999
    5 5 999
like image 158
radiumhead Avatar answered Nov 03 '22 04:11

radiumhead