Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R For loop delete range of rows from one string to a second string in a column

Tags:

for-loop

r

subset

I am trying to delete sequences of rows from a data frame, the sequence begins with a known string, and ends with a known string, but the content and number of the intervening rows is unknown. I would like to iterate this over the entire data frame.

For example, if the data frame is as below, I would like to remove the rows from all instances of StringA to StringB (inclusive) but retain the rows which follow StringB up to the next recurrence of StringA; for the example below, that is, I would like to remove the rows containing StringA, unknownC, unknownD, unknownS, StringB, but then retain unknownK and unknownR, then continue deleting at StringA, unknownU, unknownP, StringB, but retain unknownT.

Column 1  Column 2 
StringA     1
unknownC    9
unknownD   11
unknownS    5
StringB    7
unknownK    6
unknownR    1
StringA    76
unknownU    2
unknownP   41
StringB    3
unknownT    9

I tried df2 <- df[1:which(df[,1]=="StringA")-1,], which is not correct but am at a loss as what other approach to try. Thank you in advance for any guidance.

like image 550
SPZ Avatar asked May 27 '16 01:05

SPZ


2 Answers

You can try something like this, by constructing the index to be removed using the Map function:

indexToRemove <- unlist(Map(`:`, which(df$`Column 1` == "StringA"), 
                                 which(df$`Column 1` == "StringB")))

df[-indexToRemove, ]
   Column 1 Column 2
6  unknownK        6
7  unknownR        1
12 unknownT        9

Data:

structure(list(`Column 1` = structure(c(1L, 3L, 4L, 8L, 2L, 5L, 
7L, 1L, 10L, 6L, 2L, 9L), .Label = c("StringA", "StringB", "unknownC", 
"unknownD", "unknownK", "unknownP", "unknownR", "unknownS", "unknownT", 
"unknownU"), class = "factor"), `Column 2` = c(1L, 9L, 11L, 5L, 
7L, 6L, 1L, 76L, 2L, 41L, 3L, 9L)), .Names = c("Column 1", "Column 2"
), class = "data.frame", row.names = c(NA, -12L))
like image 89
Psidom Avatar answered Nov 12 '22 23:11

Psidom


You can use a for loop. Although this will be slower than the vectorised solutions posted, it does have some merits in terms of being quite versatile to adapt to similar related problems, and being robust against unexpected input data.

Notes:

  1. This method is robust against oddities in the input data - it does not depend on having always alternating and always paired, StringA...StringB pairs, nor does it assume that StringA will always occur before StringB. Every time it encounters StringA it will start deleting rows until it encounters StringB.
  2. On the down side, using this method on very large data frames could be slow, as we are growing a dataframe inside the loop (always guaranteed to slow down large operations).

The code:

keep.line <- TRUE
out.df <- data.frame()

for (i in 1:NROW(my.df)) {
  if (my.df[i,]$Column1 == "StringA") keep.line <- FALSE
  if (keep.line) out.df <- rbind(out.df, my.df[i,])
  if (my.df[i,]$Column1 == "StringB") keep.line <- TRUE
}

out.df
##    Column1    Column2
##    unknownK  0.3679608
##    unknownR -0.8867749
##    unknownT  1.6277386

Some data:

Column1 <-c( 
"StringA" ,    
"unknownC",    
"unknownD",   
"unknownS",   
"StringB" ,   
"unknownK",   
"unknownR",   
"StringA" ,   
"unknownU",   
"unknownP",   
"StringB" ,   
"unknownT")

my.df <- data.frame(Column1, Column2 = rnorm(12), stringsAsFactors = F)
like image 29
dww Avatar answered Nov 13 '22 00:11

dww