Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find start and end of ranges where data is upper case

Tags:

r

aggregate

I have a data.frame ystr:

    v1
1    a
2    B
3    B
4    C
5    d
6    a
7    B
8    D

I want to find the start and end of each group of letters in CAPS so my output would be:

    groupId startPos    endPos
1   1       2           4
2   2       7           8

I was able to do it with a for loop by looking at each element in order and comparing it to the one before as follows:

currentGroupId <-0

for (i in 1:length(ystr[,1])){ 
  if (grepl("[[:upper:]]", ystr[i,])) 
  { 
    if (startCounter == 0) 
    {
       currentGroupId <- currentGroupId +1
       startCounter <-1 
       mygroups[currentGroupId,] <- c(currentGroupId, i, 0)
    }
  }else if (startCounter == 1){
    startCounter <-0
    mygroups[currentGroupId,3]<- i-1
  }
}

Is there a simple way of doing this in R?

This might be similar to Mark start and end of groups but I could not figure out how it would apply in this case.

like image 376
user2909302 Avatar asked Dec 24 '22 11:12

user2909302


2 Answers

You can do this by calculating the run-length encoding (rle) of the binary indicator for whether your data is upper case, as determined by whether the data is equal to itself when it's converted to upper case.

with(rle(d[,1] == toupper(d[,1])),
     data.frame(start=cumsum(lengths)[values]-lengths[values]+1,
                end=cumsum(lengths)[values]))
#   start end
# 1     2   4
# 2     7   8

You can see other examples of the use of rle by looking at Stack Overflow answers using this command.

Data:

d <- data.frame(v1=c("a", "B", "B", "C", "d", "a", "B", "D"))
like image 132
josliber Avatar answered Feb 01 '23 08:02

josliber


You can use the IRanges package. It's basically to find the consecutive ranges.

d <- data.frame(v1=c("a", "B", "B", "C", "d", "a", "B", "D"))
d.idx <- which(d$v1 %in% LETTERS)
d.idx
# [1] 2 3 4 7 8

library(IRanges)
d.idx.ir <- IRanges(d.idx, d.idx)
reduce(d.idx.ir)
# IRanges of length 2
#     start end width
# [1]     2   4     3
# [2]     7   8     2
like image 43
Ven Yao Avatar answered Feb 01 '23 07:02

Ven Yao