I have a dataset that has data that was entered in one column that should have been put into three separate columns. I'm hoping to separate it into three columns based on a backslash, but each split must include a character prefix which is only found on the first part of the string, and a character suffix which is only found at the end.
So something like "PC211/212.5(C)/664F" has the prefix "PC", and the suffix "F". The prefix is always 2 letters, and the suffix is always 1, and they are always characters. The prefix is always followed by a numeric code, and the suffix is always preceded by either a number or an end parentheses
My data is in a very large dataframe, so I'd like to be able to call it out by column. Here's a reproducible example of a tiny subset of the data I'm working with:
df <- data.frame("code" = c("PC211/212.5(C)/664F", "VC23152(A)/23550F", "PC459/460(B)M", "PC187(A)/664F"), stringsAsFactors = FALSE)
I'd like it to return something like:
df_id_like <- data.frame("code" = c("PC211/212.5(C)/664F", "VC23152(A)/23550F", "PC459/460(B)M", "PC187(A)/664F"),
"code_1" = c("PC211F", "VC23152(A)F", "PC459M", "PC187F"),
"code_2" = c("212.5(C)F", "VC23550F", "PC460(B)M", "PC664F"),
"code_3" = c("PC664F", NA, NA, NA),
stringsAsFactors = FALSE)
I think the solution may need regex, but am totally open to a solution that doesn't!
To split a column into multiple columns in the R Language, We use the str_split_fixed() function of the stringr package library. The str_split_fixed() function splits up a string into a fixed number of pieces.
Select Home > Split Column > By Number of Characters. The Split a column by Number of Characters dialog box appears. In the Number of characters textbox, enter the number of characters used to split the text column.
Splitting Strings in R programming – strsplit() method strsplit() method in R Programming Language is used to split the string by using a delimiter.
A tidyr
option using separate
library(dplyr)
library(tidyr)
df %>% separate(code, paste0("code_", 1:3), sep = "/", fill = "right", remove = F)
# code code_1 code_2 code_3
#1 PC211/212.5(C)/664F PC211 212.5(C) 664F
#2 VC23152(A)/23550F VC23152(A) 23550F <NA>
#3 PC459/460(B)M PC459 460(B)M <NA>
#4 PC187(A)/664F PC187(A) 664F <NA>
Note that your expected output doesn't seem to match your input data. For example, for row 1 your expected output for code_3
gives "PC664F"
, whereas the relevant input string is "664F"
. code_2
for the same row has "212.5(C)F"
whereas the input string is "212.5(C)"
. I assume these are errors.
Thanks to @andrew_reece's comment I (think I) now understand your question. Here is an option
df %>%
rowid_to_column("row") %>%
separate(code, c("prefix", "main", "suffix"), sep = c(2, -1), remove = F) %>%
separate(main, into = paste0("code_", 1:3), sep = "/", fill = "right") %>%
gather(key, entry, starts_with("code_")) %>%
filter(!is.na(entry)) %>%
unite(entry, prefix, entry, suffix, sep = "") %>%
spread(key, entry) %>%
select(-row)
Explanation: We first separate
the prefixes and suffixes from code
, then separate
the individual components from the main code
part. We reshape from wide to long, remove NA
entries, and concatenate each code
component with the prefix
and suffix
before reshaping back from long to wide.
This reproduces your expected output except for code_2
in row 1.
As an alternative approach, it might be more useful to store the pre- & suffixed codes in a list
column, rather than storing them in a wide format with additional columns code_1
, code_2
and so on. This has the advantage that you wouldn't have to hard-code the number of codes you have in column code
; the following approach will work for any number of codes in code
and only assumes that
code
define the prefix
code
is the suffix
.df %>%
separate(code, c("prefix", "main", "suffix"), sep = c(2, -1), remove = F) %>%
transmute(
code,
codes_as_list = pmap(
list(prefix, str_split(main, "/"), suffix),
function(x, y, z) paste0(x, y, z)))
# code codes_as_list
#1 PC211/212.5(C)/664F PC211F, PC212.5(C)F, PC664F
#2 VC23152(A)/23550F VC23152(A)F, VC23550F
#3 PC459/460(B)M PC459M, PC460(B)M
#4 PC187(A)/664F PC187(A)F, PC664F
Note that codes_as_list
is now a list
column with the properly pre/suffixed codes, making it easy to operate on the elements with the purrr::map
machinery.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With