I am working on HCUP data and this has range of values in one single column that needs to be split into multiple columns. Below is the HCUP data frame for reference :
code label
61000-61003 excision of CNS
0169T-0169T ventricular shunt
The desired output should be :
code label
61000 excision of CNS
61001 excision of CNS
61002 excision of CNS
61003 excision of CNS
0169T ventricular shunt
My approach to this problem is using the package splitstackshape and using this code
library(data.table)
library(splitstackshape)
cSplit(hcup, "code", "-")[, list(code = code_1:code_2, by = label)]
This approach leads to memory issues. Is there a better approach to this problem?
Some comments :
To split a column into multiple columns in the R Language, we use the separator() function of the dplyr package library. The separate() function separates a character column into multiple columns with a regular expression or numeric locations.
Split column by delimiter into multiple columnsApply the pandas series str. split() function on the “Address” column and pass the delimiter (comma in this case) on which you want to split the column. Also, make sure to pass True to the expand parameter.
Click in a cell, or select multiple cells that you want to split. Under Table Tools, on the Layout tab, in the Merge group, click Split Cells. Enter the number of columns or rows that you want to split the selected cells into.
Here's a solution using dplyr
and all.is.numeric
from Hmisc
:
library(dplyr)
library(Hmisc)
library(tidyr)
dat %>% separate(code, into=c("code1", "code2")) %>%
rowwise %>%
mutate(lists = ifelse(all.is.numeric(c(code1, code2)),
list(as.character(seq(from = as.numeric(code1), to = as.numeric(code2)))),
list(code1))) %>%
unnest(lists) %>%
select(code = lists, label)
Source: local data frame [5 x 2]
code label
(chr) (fctr)
1 61000 excision of CNS
2 61001 excision of CNS
3 61002 excision of CNS
4 61003 excision of CNS
5 0169T ventricular shunt
An edit to fix ranges with character values. Brings down the simplicity a little:
dff %>% mutate(row = row_number()) %>%
separate(code, into=c("code1", "code2")) %>%
group_by(row) %>%
summarise(lists = if(all.is.numeric(c(code1, code2)))
{list(str_pad(as.character(
seq(from = as.numeric(code1), to = as.numeric(code2))),
nchar(code1), pad="0"))}
else if(grepl("^[0-9]", code1))
{list(str_pad(paste0(as.character(
seq(from = extract_numeric(code1), to = extract_numeric(code2))),
strsplit(code1, "[0-9]+")[[1]][2]),
nchar(code1), pad = "0"))}
else
{list(paste0(
strsplit(code1, "[0-9]+")[[1]],
str_pad(as.character(
seq(from = extract_numeric(code1), to = extract_numeric(code2))),
nchar(gsub("[^0-9]", "", code1)), pad="0")))},
label = first(label)) %>%
unnest(lists) %>%
select(-row)
Source: local data frame [15 x 2]
label lists
(chr) (chr)
1 excision of CNS 61000
2 excision of CNS 61001
3 excision of CNS 61002
4 ventricular shunt 0169T
5 ventricular shunt 0170T
6 ventricular shunt 0171T
7 excision of CNS 01000
8 excision of CNS 01001
9 excision of CNS 01002
10 some procedure A2543
11 some procedure A2544
12 some procedure A2545
13 some procedure A0543
14 some procedure A0544
15 some procedure A0545
data:
dff <- structure(list(code = c("61000-61002", "0169T-0171T", "01000-01002",
"A2543-A2545", "A0543-A0545"), label = c("excision of CNS", "ventricular shunt",
"excision of CNS", "some procedure", "some procedure")), .Names = c("code",
"label"), row.names = c(NA, 5L), class = "data.frame")
Original Answer: See below for update.
First, I made your example data a little more challenging by adding the first row to the bottom.
dff <- structure(list(code = c("61000-61003", "0169T-0169T", "61000-61003"
), label = c("excision of CNS", "ventricular shunt", "excision of CNS"
)), .Names = c("code", "label"), row.names = c(NA, 3L), class = "data.frame")
dff
# code label
# 1 61000-61003 excision of CNS
# 2 0169T-0169T ventricular shunt
# 3 61000-61003 excision of CNS
We can use the sequence operator :
to get the sequences for the code
column, wrapping with tryCatch()
so we can avoid an error on, and save the values that cannot be sequenced. First we split the values by the dash mark -
then run it through lapply()
.
xx <- lapply(
strsplit(dff$code, "-", fixed = TRUE),
function(x) tryCatch(x[1]:x[2], warning = function(w) x)
)
data.frame(code = unlist(xx), label = rep(dff$label, lengths(xx)))
# code label
# 1 61000 excision of CNS
# 2 61001 excision of CNS
# 3 61002 excision of CNS
# 4 61003 excision of CNS
# 5 0169T ventricular shunt
# 6 0169T ventricular shunt
# 7 61000 excision of CNS
# 8 61001 excision of CNS
# 9 61002 excision of CNS
# 10 61003 excision of CNS
We're trying to apply the sequence operator :
to each element from strsplit()
, and if taking x[1]:x[2]
is not possible then this returns just the values for those elements and proceeds with the sequence x[1]:x[2]
otherwise. Then we just replicate the values of the label
column based on the resulting lengths in xx
to get the new label
column.
Update: Here is what I've come up with in response to your edit. Replace xx
above with
xx <- lapply(strsplit(dff$code, "-", TRUE), function(x) {
s <- stringi::stri_locate_first_regex(x, "[A-Z]")
nc <- nchar(x)[1L]
fmt <- function(n) paste0("%0", n, "d")
if(!all(is.na(s))) {
ss <- s[1,1]
fmt <- fmt(nc-1)
if(ss == 1L) {
xx <- substr(x, 2, nc)
paste0(substr(x, 1, 1), sprintf(fmt, xx[1]:xx[2]))
} else {
xx <- substr(x, 1, ss-1)
paste0(sprintf(fmt, xx[1]:xx[2]), substr(x, nc, nc))
}
} else {
sprintf(fmt(nc), x[1]:x[2])
}
})
Yep, it's complicated. Now if we take the following data frame df2
as a test case
df2 <- structure(list(code = c("61000-61003", "0169T-0174T", "61000-61003",
"T0169-T0174"), label = c("excision of CNS", "ventricular shunt",
"excision of CNS", "ventricular shunt")), .Names = c("code",
"label"), row.names = c(NA, 4L), class = "data.frame")
and run the xx
code from above on it, we can get the following result.
data.frame(code = unlist(xx), label = rep(df2$label, lengths(xx)))
# code label
# 1 61000 excision of CNS
# 2 61001 excision of CNS
# 3 61002 excision of CNS
# 4 61003 excision of CNS
# 5 0169T ventricular shunt
# 6 0170T ventricular shunt
# 7 0171T ventricular shunt
# 8 0172T ventricular shunt
# 9 0173T ventricular shunt
# 10 0174T ventricular shunt
# 11 61000 excision of CNS
# 12 61001 excision of CNS
# 13 61002 excision of CNS
# 14 61003 excision of CNS
# 15 T0169 ventricular shunt
# 16 T0170 ventricular shunt
# 17 T0171 ventricular shunt
# 18 T0172 ventricular shunt
# 19 T0173 ventricular shunt
# 20 T0174 ventricular shunt
Create a sequencing rule for such codes:
seq_code <- function(from,to){
ext = function(x, part) gsub("([^0-9]?)([0-9]*)([^0-9]?)", paste0("\\",part), x)
pre = unique(sapply(list(from,to), ext, part = 1 ))
suf = unique(sapply(list(from,to), ext, part = 3 ))
if (length(pre) > 1 | length(suf) > 1){
return("NO!")
}
num = do.call(seq, lapply(list(from,to), function(x) as.integer(ext(x, part = 2))))
len = nchar(from)-nchar(pre)-nchar(suf)
paste0(pre, sprintf(paste0("%0",len,"d"), num), suf)
}
With @jeremycg's example:
setDT(dff)[,.(
label = label[1],
code = do.call(seq_code, tstrsplit(code,'-'))
), by=.(row=seq(nrow(dff)))]
which gives
row label code
1: 1 excision of CNS 61000
2: 1 excision of CNS 61001
3: 1 excision of CNS 61002
4: 2 ventricular shunt 0169T
5: 2 ventricular shunt 0170T
6: 2 ventricular shunt 0171T
7: 3 excision of CNS 01000
8: 3 excision of CNS 01001
9: 3 excision of CNS 01002
10: 4 some procedure A2543
11: 4 some procedure A2544
12: 4 some procedure A2545
13: 5 some procedure A0543
14: 5 some procedure A0544
15: 5 some procedure A0545
Data copied from @jeremycg's answer:
dff <- structure(list(code = c("61000-61002", "0169T-0171T", "01000-01002",
"A2543-A2545", "A0543-A0545"), label = c("excision of CNS", "ventricular shunt",
"excision of CNS", "some procedure", "some procedure")), .Names = c("code",
"label"), row.names = c(NA, 5L), class = "data.frame")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With