How to split all strings in a column AND include prefix in all the new data

Tags:

I have a dataset that has data that was entered in one column that should have been put into three separate columns. I'm hoping to separate it into three columns based on a backslash, but each split must include a character prefix which is only found on the first part of the string, and a character suffix which is only found at the end.

So something like "PC211/212.5(C)/664F" has the prefix "PC", and the suffix "F". The prefix is always 2 letters, and the suffix is always 1, and they are always characters. The prefix is always followed by a numeric code, and the suffix is always preceded by either a number or an end parentheses

My data is in a very large dataframe, so I'd like to be able to call it out by column. Here's a reproducible example of a tiny subset of the data I'm working with:

df <- data.frame("code" = c("PC211/212.5(C)/664F", "VC23152(A)/23550F", "PC459/460(B)M", "PC187(A)/664F"), stringsAsFactors = FALSE)

I'd like it to return something like:

df_id_like <- data.frame("code" = c("PC211/212.5(C)/664F", "VC23152(A)/23550F", "PC459/460(B)M", "PC187(A)/664F"), 
"code_1" = c("PC211F", "VC23152(A)F", "PC459M", "PC187F"), 
"code_2" = c("212.5(C)F", "VC23550F", "PC460(B)M", "PC664F"), 
"code_3" = c("PC664F", NA, NA, NA), 
stringsAsFactors = FALSE)

I think the solution may need regex, but am totally open to a solution that doesn't!

564

asked May 13 '19 01:05

cskn

1 Answers

A tidyr option using separate

library(dplyr)
library(tidyr)
df %>% separate(code, paste0("code_", 1:3), sep = "/", fill = "right", remove = F)
#                 code     code_1   code_2 code_3
#1 PC211/212.5(C)/664F      PC211 212.5(C)   664F
#2   VC23152(A)/23550F VC23152(A)   23550F   <NA>
#3       PC459/460(B)M      PC459  460(B)M   <NA>
#4       PC187(A)/664F   PC187(A)     664F   <NA>

Note that your expected output doesn't seem to match your input data. For example, for row 1 your expected output for code_3 gives "PC664F", whereas the relevant input string is "664F". code_2 for the same row has "212.5(C)F" whereas the input string is "212.5(C)". I assume these are errors.

Update

Thanks to @andrew_reece's comment I (think I) now understand your question. Here is an option

df %>%
    rowid_to_column("row") %>%
    separate(code, c("prefix", "main", "suffix"), sep = c(2, -1), remove = F) %>%
    separate(main, into = paste0("code_", 1:3), sep = "/", fill = "right") %>%
    gather(key, entry, starts_with("code_")) %>%
    filter(!is.na(entry)) %>%
    unite(entry, prefix, entry, suffix, sep = "") %>%
    spread(key, entry) %>%
    select(-row)

Explanation: We first separate the prefixes and suffixes from code, then separate the individual components from the main code part. We reshape from wide to long, remove NA entries, and concatenate each code component with the prefix and suffix before reshaping back from long to wide.

This reproduces your expected output except for code_2 in row 1.

An alternative

As an alternative approach, it might be more useful to store the pre- & suffixed codes in a list column, rather than storing them in a wide format with additional columns code_1, code_2 and so on. This has the advantage that you wouldn't have to hard-code the number of codes you have in column code; the following approach will work for any number of codes in code and only assumes that

the first 2 characters of code define the prefix
the last character of code is the suffix.

df %>%
    separate(code, c("prefix", "main", "suffix"), sep = c(2, -1), remove = F) %>%
    transmute(
        code,
        codes_as_list = pmap(
            list(prefix, str_split(main, "/"), suffix),
            function(x, y, z) paste0(x, y, z)))
#                 code               codes_as_list
#1 PC211/212.5(C)/664F PC211F, PC212.5(C)F, PC664F
#2   VC23152(A)/23550F       VC23152(A)F, VC23550F
#3       PC459/460(B)M           PC459M, PC460(B)M
#4       PC187(A)/664F           PC187(A)F, PC664F

Note that codes_as_list is now a list column with the properly pre/suffixed codes, making it easy to operate on the elements with the purrr::map machinery.

answered Nov 15 '22 01:11

Maurits Evers

Related questions
                            
                                Can R read html-encoded emoji characters?
                            
                                How to conditionally replace values in r data frame using if/then statement
                            
                                Converting a number into time (0,5 of an hour = 00:30:00)
                            
                                R bookdown - custom title page
                            
                                Add titles to ggplots created with map()
                            
                                Set transparency/saturation of palette in ggplot
                            
                                Creating a named vector using dplyr
                            
                                Size legend of sf object won't show correct symbols
                            
                                Stacked barplot with colour gradients for each bar
                            
                                Error in osmar::get_osm() downloading OSM data fails: SYSTEM or PUBLIC, the URI is missing
                            
                                Singularity in backsolve at level 0, block 1 in LME model
                            
                                RDS file size difference between ggplot2 objects created inside vs. outside function
                            
                                Split and re-concatenate a string
                            
                                Retrieve Census tract from Coordinates [closed]
                            
                                dplyr lag with n from column values
                            
                                Center leaflet in a rmarkdown document
                            
                                Fixing the order of a Sankey flow graph in R / networkD3 package
                            
                                How to convert the result of xtabs() into dataframe in R? [duplicate]
                            
                                name character vectors with same name of list
                            
                                How to make in R matrix of intersections and unions over categories?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With