I am working on HCUP data and this has range of values in one single column that needs to be split into multiple columns. Below is the HCUP data frame for reference : <pre class="prettyprint"><code>code label 61000-61003 excision of CNS 0169T-0169T ventricular shunt </code></pre> The desired output should be : <pre class="prettyprint"><code>code label 61000 excision of CNS 61001 excision of CNS 61002 excision of CNS 61003 excision of CNS 0169T ventricular shunt </code></pre> My approach to this problem is using the package splitstackshape and using this code <pre class="prettyprint"><code>library(data.table) library(splitstackshape) cSplit(hcup, "code", "-")[, list(code = code_1:code_2, by = label)] </code></pre> This approach leads to memory issues. Is there a better approach to this problem? Some comments : <ul> <li>The data has many letters apart from "T".</li> <li>The letter can be either in the front or at the very end but not in between two numbers. </li> <li>There is no change of letter from "T" to "U" in one single range</li> </ul>

Original Answer: See below for update. First, I made your example data a little more challenging by adding the first row to the bottom. <pre class="prettyprint"><code>dff <- structure(list(code = c("61000-61003", "0169T-0169T", "61000-61003" ), label = c("excision of CNS", "ventricular shunt", "excision of CNS" )), .Names = c("code", "label"), row.names = c(NA, 3L), class = "data.frame") dff # code label # 1 61000-61003 excision of CNS # 2 0169T-0169T ventricular shunt # 3 61000-61003 excision of CNS </code></pre> We can use the sequence operator <code>:</code> to get the sequences for the <code>code</code> column, wrapping with <code>tryCatch()</code> so we can avoid an error on, and save the values that cannot be sequenced. First we split the values by the dash mark <code>-</code> then run it through <code>lapply()</code>. <pre class="prettyprint"><code>xx <- lapply( strsplit(dff$code, "-", fixed = TRUE), function(x) tryCatch(x[1]:x[2], warning = function(w) x) ) data.frame(code = unlist(xx), label = rep(dff$label, lengths(xx))) # code label # 1 61000 excision of CNS # 2 61001 excision of CNS # 3 61002 excision of CNS # 4 61003 excision of CNS # 5 0169T ventricular shunt # 6 0169T ventricular shunt # 7 61000 excision of CNS # 8 61001 excision of CNS # 9 61002 excision of CNS # 10 61003 excision of CNS </code></pre> We're trying to apply the sequence operator <code>:</code> to each element from <code>strsplit()</code>, and if taking <code>x[1]:x[2]</code> is not possible then this returns just the values for those elements and proceeds with the sequence <code>x[1]:x[2]</code> otherwise. Then we just replicate the values of the <code>label</code> column based on the resulting lengths in <code>xx</code> to get the new <code>label</code> column. <hr> Update: Here is what I've come up with in response to your edit. Replace <code>xx</code> above with <pre class="prettyprint"><code>xx <- lapply(strsplit(dff$code, "-", TRUE), function(x) { s <- stringi::stri_locate_first_regex(x, "[A-Z]") nc <- nchar(x)[1L] fmt <- function(n) paste0("%0", n, "d") if(!all(is.na(s))) { ss <- s[1,1] fmt <- fmt(nc-1) if(ss == 1L) { xx <- substr(x, 2, nc) paste0(substr(x, 1, 1), sprintf(fmt, xx[1]:xx[2])) } else { xx <- substr(x, 1, ss-1) paste0(sprintf(fmt, xx[1]:xx[2]), substr(x, nc, nc)) } } else { sprintf(fmt(nc), x[1]:x[2]) } }) </code></pre> Yep, it's complicated. Now if we take the following data frame <code>df2</code> as a test case <pre class="prettyprint"><code>df2 <- structure(list(code = c("61000-61003", "0169T-0174T", "61000-61003", "T0169-T0174"), label = c("excision of CNS", "ventricular shunt", "excision of CNS", "ventricular shunt")), .Names = c("code", "label"), row.names = c(NA, 4L), class = "data.frame") </code></pre> and run the <code>xx</code> code from above on it, we can get the following result. <pre class="prettyprint"><code>data.frame(code = unlist(xx), label = rep(df2$label, lengths(xx))) # code label # 1 61000 excision of CNS # 2 61001 excision of CNS # 3 61002 excision of CNS # 4 61003 excision of CNS # 5 0169T ventricular shunt # 6 0170T ventricular shunt # 7 0171T ventricular shunt # 8 0172T ventricular shunt # 9 0173T ventricular shunt # 10 0174T ventricular shunt # 11 61000 excision of CNS # 12 61001 excision of CNS # 13 61002 excision of CNS # 14 61003 excision of CNS # 15 T0169 ventricular shunt # 16 T0170 ventricular shunt # 17 T0171 ventricular shunt # 18 T0172 ventricular shunt # 19 T0173 ventricular shunt # 20 T0174 ventricular shunt </code></pre>

Splitting a single column into multiple observation using R

Tags:

r

data.table

data-cleaning

medical

splitstackshape

I am working on HCUP data and this has range of values in one single column that needs to be split into multiple columns. Below is the HCUP data frame for reference :

code            label
61000-61003     excision of CNS
0169T-0169T     ventricular shunt

The desired output should be :

code            label
61000           excision of CNS
61001           excision of CNS
61002           excision of CNS
61003           excision of CNS
0169T           ventricular shunt

My approach to this problem is using the package splitstackshape and using this code

library(data.table)
library(splitstackshape)

cSplit(hcup, "code", "-")[, list(code = code_1:code_2, by = label)]

This approach leads to memory issues. Is there a better approach to this problem?

Some comments :

The data has many letters apart from "T".
The letter can be either in the front or at the very end but not in between two numbers.
There is no change of letter from "T" to "U" in one single range

808

asked Oct 13 '15 21:10

x1carbon

3 Answers

Here's a solution using dplyr and all.is.numeric from Hmisc:

library(dplyr)
library(Hmisc)
library(tidyr)
dat %>% separate(code, into=c("code1", "code2")) %>%
        rowwise %>%
        mutate(lists = ifelse(all.is.numeric(c(code1, code2)),
                         list(as.character(seq(from = as.numeric(code1), to = as.numeric(code2)))),
                         list(code1))) %>%
        unnest(lists) %>%
        select(code = lists, label)

Source: local data frame [5 x 2]

   code             label
  (chr)            (fctr)
1 61000   excision of CNS
2 61001   excision of CNS
3 61002   excision of CNS
4 61003   excision of CNS
5 0169T ventricular shunt

An edit to fix ranges with character values. Brings down the simplicity a little:

dff %>% mutate(row = row_number()) %>%
        separate(code, into=c("code1", "code2")) %>%
        group_by(row) %>%
        summarise(lists = if(all.is.numeric(c(code1, code2)))
                              {list(str_pad(as.character(
                                   seq(from = as.numeric(code1), to = as.numeric(code2))),
                                         nchar(code1), pad="0"))}
                          else if(grepl("^[0-9]", code1))
                              {list(str_pad(paste0(as.character(
                                   seq(from = extract_numeric(code1), to = extract_numeric(code2))),
                                      strsplit(code1, "[0-9]+")[[1]][2]),
                                         nchar(code1), pad = "0"))}
                          else
                              {list(paste0(
                                      strsplit(code1, "[0-9]+")[[1]],
                                      str_pad(as.character(
                                    seq(from = extract_numeric(code1), to = extract_numeric(code2))),
                                         nchar(gsub("[^0-9]", "", code1)), pad="0")))},
                   label = first(label)) %>%
        unnest(lists) %>%
        select(-row)
Source: local data frame [15 x 2]

               label lists
               (chr) (chr)
1    excision of CNS 61000
2    excision of CNS 61001
3    excision of CNS 61002
4  ventricular shunt 0169T
5  ventricular shunt 0170T
6  ventricular shunt 0171T
7    excision of CNS 01000
8    excision of CNS 01001
9    excision of CNS 01002
10    some procedure A2543
11    some procedure A2544
12    some procedure A2545
13    some procedure A0543
14    some procedure A0544
15    some procedure A0545

data:

dff <- structure(list(code = c("61000-61002", "0169T-0171T", "01000-01002", 
"A2543-A2545", "A0543-A0545"), label = c("excision of CNS", "ventricular shunt", 
"excision of CNS", "some procedure", "some procedure")), .Names = c("code", 
"label"), row.names = c(NA, 5L), class = "data.frame")

answered Sep 19 '22 15:09

jeremycg

Original Answer: See below for update.

First, I made your example data a little more challenging by adding the first row to the bottom.

dff <- structure(list(code = c("61000-61003", "0169T-0169T", "61000-61003"
), label = c("excision of CNS", "ventricular shunt", "excision of CNS"
)), .Names = c("code", "label"), row.names = c(NA, 3L), class = "data.frame")

dff
#          code             label
# 1 61000-61003   excision of CNS
# 2 0169T-0169T ventricular shunt
# 3 61000-61003   excision of CNS

We can use the sequence operator : to get the sequences for the code column, wrapping with tryCatch() so we can avoid an error on, and save the values that cannot be sequenced. First we split the values by the dash mark - then run it through lapply().

xx <- lapply(
    strsplit(dff$code, "-", fixed = TRUE), 
    function(x) tryCatch(x[1]:x[2], warning = function(w) x)
)
data.frame(code = unlist(xx), label = rep(dff$label, lengths(xx)))
#     code             label
# 1  61000   excision of CNS
# 2  61001   excision of CNS
# 3  61002   excision of CNS
# 4  61003   excision of CNS
# 5  0169T ventricular shunt
# 6  0169T ventricular shunt
# 7  61000   excision of CNS
# 8  61001   excision of CNS
# 9  61002   excision of CNS
# 10 61003   excision of CNS

We're trying to apply the sequence operator : to each element from strsplit(), and if taking x[1]:x[2] is not possible then this returns just the values for those elements and proceeds with the sequence x[1]:x[2] otherwise. Then we just replicate the values of the label column based on the resulting lengths in xx to get the new label column.

Update: Here is what I've come up with in response to your edit. Replace xx above with

xx <- lapply(strsplit(dff$code, "-", TRUE), function(x) {
    s <- stringi::stri_locate_first_regex(x, "[A-Z]")
    nc <- nchar(x)[1L]
    fmt <- function(n) paste0("%0", n, "d")
    if(!all(is.na(s))) {
        ss <- s[1,1]
        fmt <- fmt(nc-1)
        if(ss == 1L) {
            xx <- substr(x, 2, nc)
            paste0(substr(x, 1, 1), sprintf(fmt, xx[1]:xx[2]))
        } else {
            xx <- substr(x, 1, ss-1)
            paste0(sprintf(fmt, xx[1]:xx[2]), substr(x, nc, nc))
        }
    } else {
        sprintf(fmt(nc), x[1]:x[2])
    }
})

Yep, it's complicated. Now if we take the following data frame df2 as a test case

df2 <- structure(list(code = c("61000-61003", "0169T-0174T", "61000-61003", 
"T0169-T0174"), label = c("excision of CNS", "ventricular shunt", 
"excision of CNS", "ventricular shunt")), .Names = c("code", 
"label"), row.names = c(NA, 4L), class = "data.frame")

and run the xx code from above on it, we can get the following result.

data.frame(code = unlist(xx), label = rep(df2$label, lengths(xx)))
#     code             label
# 1  61000   excision of CNS
# 2  61001   excision of CNS
# 3  61002   excision of CNS
# 4  61003   excision of CNS
# 5  0169T ventricular shunt
# 6  0170T ventricular shunt
# 7  0171T ventricular shunt
# 8  0172T ventricular shunt
# 9  0173T ventricular shunt
# 10 0174T ventricular shunt
# 11 61000   excision of CNS
# 12 61001   excision of CNS
# 13 61002   excision of CNS
# 14 61003   excision of CNS
# 15 T0169 ventricular shunt
# 16 T0170 ventricular shunt
# 17 T0171 ventricular shunt
# 18 T0172 ventricular shunt
# 19 T0173 ventricular shunt
# 20 T0174 ventricular shunt

answered Sep 20 '22 15:09

Rich Scriven

Create a sequencing rule for such codes:

seq_code <- function(from,to){

    ext = function(x, part) gsub("([^0-9]?)([0-9]*)([^0-9]?)", paste0("\\",part), x)

    pre = unique(sapply(list(from,to), ext, part = 1 ))
    suf = unique(sapply(list(from,to), ext, part = 3 ))

    if (length(pre) > 1 | length(suf) > 1){
        return("NO!")
    }

    num = do.call(seq, lapply(list(from,to), function(x) as.integer(ext(x, part = 2))))
    len = nchar(from)-nchar(pre)-nchar(suf)

    paste0(pre, sprintf(paste0("%0",len,"d"), num), suf)

}

With @jeremycg's example:

setDT(dff)[,.(
  label = label[1], 
  code  = do.call(seq_code, tstrsplit(code,'-'))
), by=.(row=seq(nrow(dff)))]

which gives

    row             label  code
 1:   1   excision of CNS 61000
 2:   1   excision of CNS 61001
 3:   1   excision of CNS 61002
 4:   2 ventricular shunt 0169T
 5:   2 ventricular shunt 0170T
 6:   2 ventricular shunt 0171T
 7:   3   excision of CNS 01000
 8:   3   excision of CNS 01001
 9:   3   excision of CNS 01002
10:   4    some procedure A2543
11:   4    some procedure A2544
12:   4    some procedure A2545
13:   5    some procedure A0543
14:   5    some procedure A0544
15:   5    some procedure A0545

Data copied from @jeremycg's answer:

dff <- structure(list(code = c("61000-61002", "0169T-0171T", "01000-01002", 
"A2543-A2545", "A0543-A0545"), label = c("excision of CNS", "ventricular shunt", 
"excision of CNS", "some procedure", "some procedure")), .Names = c("code", 
"label"), row.names = c(NA, 5L), class = "data.frame")

answered Sep 19 '22 15:09

Frank

Related questions
                            
                                In R, Merge two data frames, fill down the blanks
                            
                                split string with regex
                            
                                Why doesn't the plyr package use my parallel backend?
                            
                                Simple method of counting non-NAs in column of data String [duplicate]
                            
                                Subset variables in data frame based on column type
                            
                                R calculate the standard error using bootstrap
                            
                                Passing large matrices to RcppArmadillo function without creating copy (advanced constructors)
                            
                                Efficient method to subset drop rows with NA values in R
                            
                                Count the number of pattern matches in a string
                            
                                How can I extract factor loadings from lavaan?
                            
                                mean( ,na.rm=TRUE) still returns NA
                            
                                Replace text that appears at the end of a string
                            
                                Use string as filter in dplyr?
                            
                                How to build a crossword-like plot for a boolean matrix
                            
                                R: find vector in list of vectors
                            
                                Search for and remove outliers from a dataframe grouped by a variable
                            
                                R-ranking values of a column by grouping, conditional to another variable
                            
                                Basic - T-Test -> Grouping Factor Must have Exactly 2 Levels
                            
                                data.table replacing a value by NA [duplicate]
                            
                                Most Efficient way to create a symmetric matrix

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With