Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: How to parse a vector of values with three rows, then the parsed output should be a list corresponding to the row

Tags:

string

parsing

r

I have a 3 rows by 1 column vector with values: "S0027A-E", "S0028A-D", "S0029A-C"; hence:

input_string <- as.vector(c("S0027A-E", "S0028A-D", "S0029A-C"))

The output parsed strings must be a list separated by comma corresponding to each value in the input vector such that:

input_string parsed_strings
"S0027A-E" "S0027A", "S0027B", "S0027C", "S0027D", "S0027E"
"S0028A-D" "S0028A", "S0028B", "S0028C", "S0028D"
"S0029A-C" "S0029A", "S0029B", "S0029C"

I have already initially created the parsing script but the output is incorrectly a vector of 1 row with 12 elements: "S0027A" "S0027B" "S0027C" "S0027D" "S0027E" "S0028A" "S0028B" "S0028C" "S0028D" "S0029A" "S0029B" "S0029C"- all in one row instead of the output shown in the table.

# length of input_string
len_string = length(input_string)

# Extract the prefix, start, and end letters
library(stringr)

parsed_strings <- as.character()

for (i in 1:len_string){
  prefix <- str_extract(input_string[[i]][1], "^[A-Z]\\d{4}")
  range_part <- str_extract(input_string[[i]][1], "[A-Z]-[A-Z]$")
  start_letter <- substr(range_part, 1, 1)
  end_letter <- substr(range_part, 3, 3)
  output <- paste0(prefix, LETTERS[match(start_letter, LETTERS):match(end_letter, LETTERS)])
  parsed_strings <- c(parsed_strings, output)
}

The output must be as shown in the table so greatly appreciate any advise to rectify my code. Thanks in advance!

like image 744
Wilfredo de Vera Avatar asked Sep 02 '25 15:09

Wilfredo de Vera


2 Answers

You can try

lapply(
    strsplit(input_string, split = "(?<=\\d)(?=\\D)|-", perl = TRUE),
    \(x) {
        paste0(x[1], LETTERS[LETTERS >= x[2] & LETTERS <= x[3]])
    }
)

which gives

[[1]]
[1] "S0027A" "S0027B" "S0027C" "S0027D" "S0027E"

[[2]]
[1] "S0028A" "S0028B" "S0028C" "S0028D"

[[3]]
[1] "S0029A" "S0029B" "S0029C"

If you want the output presented in a data frame, you can try

within(
    data.frame(input_string),
    parsed_strings <- lapply(
        strsplit(input_string, split = "(?<=\\d)(?=\\D)|-", perl = TRUE),
        \(x) {
            paste0(x[1], LETTERS[LETTERS >= x[2] & LETTERS <= x[3]])
        }
    )
)

which shows

  input_string                         parsed_strings
1     S0027A-E S0027A, S0027B, S0027C, S0027D, S0027E
2     S0028A-D         S0028A, S0028B, S0028C, S0028D
3     S0029A-C                 S0029A, S0029B, S0029C
like image 71
ThomasIsCoding Avatar answered Sep 04 '25 05:09

ThomasIsCoding


A data.table approach

Prerequisites, a function seqChar() to get a sequence of letters from start and end character and the prefix string prefix.

input_string <- c("S0027A-E", "S0028A-D", "S0029A-C") # Data

seqChar <- function(a, b) { lett <- LETTERS
                            lett[which(lett == a):which(lett == b)] }

prefix <- sub("(.*)\\D-\\D$", "\\1", input_string)
library(data.table)

data.table(input_string)[, .(parsed_strings = 
  lapply(strsplit(sub(".*(\\D)-", "\\1", input_string), ""), \(x) 
    paste0(prefix, seqChar(x[1], x[2])))), by = input_string]

output

   input_string                     parsed_strings
         <char>                             <list>
1:     S0027A-E S0027A,S0028B,S0029C,S0027D,S0028E
2:     S0028A-D        S0027A,S0028B,S0029C,S0027D
3:     S0029A-C               S0027A,S0028B,S0029C
like image 29
Andre Wildberg Avatar answered Sep 04 '25 05:09

Andre Wildberg