I have a data.frame ystr:
v1
1 a
2 B
3 B
4 C
5 d
6 a
7 B
8 D
I want to find the start and end of each group of letters in CAPS so my output would be:
groupId startPos endPos
1 1 2 4
2 2 7 8
I was able to do it with a for loop by looking at each element in order and comparing it to the one before as follows:
currentGroupId <-0
for (i in 1:length(ystr[,1])){
if (grepl("[[:upper:]]", ystr[i,]))
{
if (startCounter == 0)
{
currentGroupId <- currentGroupId +1
startCounter <-1
mygroups[currentGroupId,] <- c(currentGroupId, i, 0)
}
}else if (startCounter == 1){
startCounter <-0
mygroups[currentGroupId,3]<- i-1
}
}
Is there a simple way of doing this in R?
This might be similar to Mark start and end of groups but I could not figure out how it would apply in this case.
You can do this by calculating the run-length encoding (rle
) of the binary indicator for whether your data is upper case, as determined by whether the data is equal to itself when it's converted to upper case.
with(rle(d[,1] == toupper(d[,1])),
data.frame(start=cumsum(lengths)[values]-lengths[values]+1,
end=cumsum(lengths)[values]))
# start end
# 1 2 4
# 2 7 8
You can see other examples of the use of rle
by looking at Stack Overflow answers using this command.
Data:
d <- data.frame(v1=c("a", "B", "B", "C", "d", "a", "B", "D"))
You can use the IRanges
package. It's basically to find the consecutive ranges.
d <- data.frame(v1=c("a", "B", "B", "C", "d", "a", "B", "D"))
d.idx <- which(d$v1 %in% LETTERS)
d.idx
# [1] 2 3 4 7 8
library(IRanges)
d.idx.ir <- IRanges(d.idx, d.idx)
reduce(d.idx.ir)
# IRanges of length 2
# start end width
# [1] 2 4 3
# [2] 7 8 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With