Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Isolate alphabetical strings within a larger string

Is there a way to isolate parts of a string that are in alphabetical order?

In other words, if you have a string like this: hjubcdepyvb

Could you just pull out the portion in alphabetical order?: bcde

I have thought about using the is.unsorted() function, but I'm not sure how to apply this to only a portion of a string.

like image 645
tdm Avatar asked Mar 15 '17 20:03

tdm


People also ask

How do I isolate a string in Excel?

We’re getting back to basics with Excel- the tool we all love and hate. 1) Locate the data that you want to isolate in your Excel string. Start #: Refers to the number of the first character in the string that you want to isolate. 3) Once your formula has been inputted, extend the series by clicking the square at the bottom right of the cell.

How do you find the largest character in a string?

First find the largest character in string and all the indexes where it appears. If there's only one index at which the largest character appears then your answer is substring from index of largest character to end. Else just compare all the substrings starting with largest character.

How to split text string where number comes after text?

The easiest way to split text string where number comes after text is this: To extract numbers, you search the string for every possible number from 0 to 9, get the numbers total, and return that many characters from the end of the string.

Which characters in the string list are consecutive pairs?

Given a string list, extract list which has any succession of characters as they occur in alphabetical order. Explanation : i-j, f-g, s-t are consecutive pairs.


3 Answers

Here's one way by converting to ASCII and back:

input <- "hjubcdepyvb"
spl_asc <- as.integer(charToRaw(input))       # Convert to ASCII
d1 <- diff(spl_asc) == 1                      # Find sequences
filt <- spl_asc[c(FALSE, d1) | c(d1, FALSE)]  # Only keep sequences (incl start and end)
rawToChar(as.raw(filt))                       # Convert back to character

#[1] "bcde"

Note that this will concatenate any parts that are in alphabetical order.

i.e. If input is "abcxasdicfgaqwe" then output would be abcfg.

If you wanted to get separate vectors for each sequential string, you could do the following

input <- "abcxasdicfgaqwe"
spl_asc <- as.integer(charToRaw(input))
d1 <- diff(spl_asc) == 1
r <- rle(c(FALSE, d1) | c(d1, FALSE))                   # Find boundaries
cm <- cumsum(c(1, r$lengths))                           # Map these to string positions
substring(input, cm[-length(cm)], cm[-1] - 1)[r$values] # Extract matching strings

Finally, I had to come up with a way to use regex:

input <- c("abcxasdicfgaqwe", "xufasiuxaboqdasdij", "abcikmcapnoploDEFgnm",
           "acfhgik")
(rg <- paste0("(", paste0(c(letters[-26], LETTERS[-26]),
                           "(?=", c(letters[-1], LETTERS[-1]), ")", collapse = "|"), ")+."))

#[1] "(a(?=b)|b(?=c)|c(?=d)|d(?=e)|e(?=f)|f(?=g)|g(?=h)|h(?=i)|i(?=j)|j(?=k)|
#k(?=l)|l(?=m)|m(?=n)|n(?=o)|o(?=p)|p(?=q)|q(?=r)|r(?=s)|s(?=t)|t(?=u)|u(?=v)|
#v(?=w)|w(?=x)|x(?=y)|y(?=z)|A(?=B)|B(?=C)|C(?=D)|D(?=E)|E(?=F)|F(?=G)|G(?=H)|
#H(?=I)|I(?=J)|J(?=K)|K(?=L)|L(?=M)|M(?=N)|N(?=O)|O(?=P)|P(?=Q)|Q(?=R)|R(?=S)|
#S(?=T)|T(?=U)|U(?=V)|V(?=W)|W(?=X)|X(?=Y)|Y(?=Z))+."

regmatches(input, gregexpr(rg, input, perl = TRUE))
#[[1]]
#[1] "abc" "fg" 
#
#[[2]]
#[1] "ab" "ij"
#
#[[3]]
#[1] "abc" "nop" "DEF"
#
#[[4]]
#character(0)

This regular expression will identify consecutive upper or lower case letters (but not mixed case). As demonstrated, it works for character vectors and produces a list of vectors with all the matches identified. If no match is found, the output is character(0).

like image 196
Nick Kennedy Avatar answered Nov 01 '22 19:11

Nick Kennedy


Using factor integer conversion:

input <- "hjubcdepyvb"
d1 <- diff(as.integer(factor(unlist(strsplit(input, "")), levels = letters))) == 1
filt <- c(FALSE, d1) | c(d1, FALSE)
paste(unlist(strsplit(input, ""))[filt], collapse = "")
# [1] "bcde"
like image 37
zx8754 Avatar answered Nov 01 '22 20:11

zx8754


myf = function(x){
    x = unlist(strsplit(x, ""))
    ind = charmatch(x, letters)
    d = c(0, diff(ind))
    d[d !=1] = 0
    d = d + c(sapply(1:(length(d)-1), function(i) {
        ifelse(d[i] == 0 & d[i+1] == 1, 1, 0)
    }
    ), 0)
    d = split(seq_along(d)[d!=0], with(rle(d), rep(seq_along(values), lengths))[d!=0])
    return(sapply(d, function(a) paste(x[a], collapse = "")))
}

myf(x = "hjubcdepyvblltpqrs")
#     2      4 
#"bcde" "pqrs" 
like image 45
d.b Avatar answered Nov 01 '22 18:11

d.b