Find start and end of ranges where data is upper case

Question

I have a data.frame ystr:

I want to find the start and end of each group of letters in CAPS so my output would be:

    groupId startPos    endPos
1   1       2           4
2   2       7           8

I was able to do it with a for loop by looking at each element in order and comparing it to the one before as follows:

currentGroupId <-0

for (i in 1:length(ystr[,1])){ 
  if (grepl("[[:upper:]]", ystr[i,])) 
  { 
    if (startCounter == 0) 
    {
       currentGroupId <- currentGroupId +1
       startCounter <-1 
       mygroups[currentGroupId,] <- c(currentGroupId, i, 0)
    }
  }else if (startCounter == 1){
    startCounter <-0
    mygroups[currentGroupId,3]<- i-1
  }
}

Is there a simple way of doing this in R?

This might be similar to Mark start and end of groups but I could not figure out how it would apply in this case.

josliber · Accepted Answer

You can do this by calculating the run-length encoding (rle) of the binary indicator for whether your data is upper case, as determined by whether the data is equal to itself when it's converted to upper case.

with(rle(d[,1] == toupper(d[,1])),
     data.frame(start=cumsum(lengths)[values]-lengths[values]+1,
                end=cumsum(lengths)[values]))
#   start end
# 1     2   4
# 2     7   8

You can see other examples of the use of rle by looking at Stack Overflow answers using this command.

Data:

d <- data.frame(v1=c("a", "B", "B", "C", "d", "a", "B", "D"))

Ven Yao · Answer

You can use the IRanges package. It's basically to find the consecutive ranges.

d <- data.frame(v1=c("a", "B", "B", "C", "d", "a", "B", "D"))
d.idx <- which(d$v1 %in% LETTERS)
d.idx
# [1] 2 3 4 7 8

library(IRanges)
d.idx.ir <- IRanges(d.idx, d.idx)
reduce(d.idx.ir)
# IRanges of length 2
#     start end width
# [1]     2   4     3
# [2]     7   8     2

Find start and end of ranges where data is upper case

Tags:

r

aggregate

user2909302

2 Answers

josliber

Ven Yao

Recent Activity

Donate For Us

Find start and end of ranges where data is upper case

Tags:

r

aggregate

user2909302

2 Answers

josliber

Ven Yao

Related questions

Recent Activity

Donate For Us