Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split all strings in a column AND include prefix in all the new data

Tags:

regex

r

I have a dataset that has data that was entered in one column that should have been put into three separate columns. I'm hoping to separate it into three columns based on a backslash, but each split must include a character prefix which is only found on the first part of the string, and a character suffix which is only found at the end.

So something like "PC211/212.5(C)/664F" has the prefix "PC", and the suffix "F". The prefix is always 2 letters, and the suffix is always 1, and they are always characters. The prefix is always followed by a numeric code, and the suffix is always preceded by either a number or an end parentheses

My data is in a very large dataframe, so I'd like to be able to call it out by column. Here's a reproducible example of a tiny subset of the data I'm working with:

df <- data.frame("code" = c("PC211/212.5(C)/664F", "VC23152(A)/23550F", "PC459/460(B)M", "PC187(A)/664F"), stringsAsFactors = FALSE)

I'd like it to return something like:

df_id_like <- data.frame("code" = c("PC211/212.5(C)/664F", "VC23152(A)/23550F", "PC459/460(B)M", "PC187(A)/664F"), 
"code_1" = c("PC211F", "VC23152(A)F", "PC459M", "PC187F"), 
"code_2" = c("212.5(C)F", "VC23550F", "PC460(B)M", "PC664F"), 
"code_3" = c("PC664F", NA, NA, NA), 
stringsAsFactors = FALSE)

I think the solution may need regex, but am totally open to a solution that doesn't!

like image 564
cskn Avatar asked May 13 '19 01:05

cskn


People also ask

How do I split a column into values in R?

To split a column into multiple columns in the R Language, We use the str_split_fixed() function of the stringr package library. The str_split_fixed() function splits up a string into a fixed number of pieces.

How do I split a string by a certain length in Excel?

Select Home > Split Column > By Number of Characters. The Split a column by Number of Characters dialog box appears. In the Number of characters textbox, enter the number of characters used to split the text column.

How do I split a string into characters in R?

Splitting Strings in R programming – strsplit() method strsplit() method in R Programming Language is used to split the string by using a delimiter.


1 Answers

A tidyr option using separate

library(dplyr)
library(tidyr)
df %>% separate(code, paste0("code_", 1:3), sep = "/", fill = "right", remove = F)
#                 code     code_1   code_2 code_3
#1 PC211/212.5(C)/664F      PC211 212.5(C)   664F
#2   VC23152(A)/23550F VC23152(A)   23550F   <NA>
#3       PC459/460(B)M      PC459  460(B)M   <NA>
#4       PC187(A)/664F   PC187(A)     664F   <NA>

Note that your expected output doesn't seem to match your input data. For example, for row 1 your expected output for code_3 gives "PC664F", whereas the relevant input string is "664F". code_2 for the same row has "212.5(C)F" whereas the input string is "212.5(C)". I assume these are errors.


Update

Thanks to @andrew_reece's comment I (think I) now understand your question. Here is an option

df %>%
    rowid_to_column("row") %>%
    separate(code, c("prefix", "main", "suffix"), sep = c(2, -1), remove = F) %>%
    separate(main, into = paste0("code_", 1:3), sep = "/", fill = "right") %>%
    gather(key, entry, starts_with("code_")) %>%
    filter(!is.na(entry)) %>%
    unite(entry, prefix, entry, suffix, sep = "") %>%
    spread(key, entry) %>%
    select(-row)

Explanation: We first separate the prefixes and suffixes from code, then separate the individual components from the main code part. We reshape from wide to long, remove NA entries, and concatenate each code component with the prefix and suffix before reshaping back from long to wide.

This reproduces your expected output except for code_2 in row 1.


An alternative

As an alternative approach, it might be more useful to store the pre- & suffixed codes in a list column, rather than storing them in a wide format with additional columns code_1, code_2 and so on. This has the advantage that you wouldn't have to hard-code the number of codes you have in column code; the following approach will work for any number of codes in code and only assumes that

  1. the first 2 characters of code define the prefix
  2. the last character of code is the suffix.

df %>%
    separate(code, c("prefix", "main", "suffix"), sep = c(2, -1), remove = F) %>%
    transmute(
        code,
        codes_as_list = pmap(
            list(prefix, str_split(main, "/"), suffix),
            function(x, y, z) paste0(x, y, z)))
#                 code               codes_as_list
#1 PC211/212.5(C)/664F PC211F, PC212.5(C)F, PC664F
#2   VC23152(A)/23550F       VC23152(A)F, VC23550F
#3       PC459/460(B)M           PC459M, PC460(B)M
#4       PC187(A)/664F           PC187(A)F, PC664F

Note that codes_as_list is now a list column with the properly pre/suffixed codes, making it easy to operate on the elements with the purrr::map machinery.

like image 84
Maurits Evers Avatar answered Nov 15 '22 01:11

Maurits Evers