Splitting strings from integers in R

Tags:

statistics

I've recently come across an interesting problem while trying to create a custom database.

my rows are in form:

 183746IGH
 105928759UBS

and so on (so basically an integer concatenated with a string, both of relatively random sizes.). What I'm trying to do is somehow separate the whole number in column 1 and everything else(the letters) in column 2. How can this be done? I've been trying with strsplit but it doesn't seem to offer this kind of functionality.

Thank you for any help.

306

asked May 01 '15 22:05

sdgaw erzswer

2 Answers

Other options include tstrsplit from the devel version of data.table

library(data.table)#v1.9.5+
setDT(df)[,tstrsplit(V1,'(?<=\\d)(?=\\D)', perl=TRUE, type.convert=TRUE)]
#        V1      V2
#1:   131341    adad
#2:    45365  adadar
#3:      425 cavsbsb
#4: 46567567 daadvsv

If there are elements were 'non-numeric' part appears first and 'numeric' last, then, we can use a bit more generalized option as the regex pattern,

 setDT(df)[,tstrsplit(V1, "(?<=\\d)(?=\\D)|(?<=\\D)(?=\\d)",
                  perl = TRUE)]

Or using extract from tidyr

library(tidyr)
extract(df, V1, into=c('V1', 'V2'), '(\\d+)(\\D+)', convert=TRUE)
#        V1      V2
#1   131341    adad
#2    45365  adadar
#3      425 cavsbsb
#4 46567567 daadvsv

If you need the original column as well,

 extract(df, V1, into=c('V2', 'V3'), '(\\d+)(\\D+)',
                               convert=TRUE, remove=FALSE)
 #               V1       V2      V3
 #1      131341adad   131341    adad
 #2     45365adadar    45365  adadar
 #3      425cavsbsb      425 cavsbsb
 #4 46567567daadvsv 46567567 daadvsv

For the data.table, we can use := to create the new columns so that the existing columns remain in the output, i.e.

setDT(df)[,paste0('V',2:3):=tstrsplit(V1,'(?<=\\d)(?=\\D)',
                     perl=TRUE, type.convert=TRUE)]
#               V1       V2      V3
#1:      131341adad   131341    adad
#2:     45365adadar    45365  adadar
#3:      425cavsbsb      425 cavsbsb
#4: 46567567daadvsv 46567567 daadvsv

NOTE: Both the solutions have the option to convert the class of the split columns (type.convert/convert).

data

df <- data.frame(V1 = c("131341adad", "45365adadar", "425cavsbsb", 
               "46567567daadvsv"))

127

answered Oct 20 '22 22:10

akrun

And another way with base-R and regular expressions:

all <- c(' 183746IGH','105928759UBS')

numeric <- sapply(a, function(x) sub('[[:alpha:]]+','', x))

alphabetic <- sapply(a, function(x) sub('[[:digit:]]+','', x))

    > data.frame(all,alphabetic,numeric)
                      all alphabetic   numeric
 183746IGH      183746IGH        IGH    183746
105928759UBS 105928759UBS        UBS 105928759

Or as per @rawr's comment below:

> read.table(text = gsub('(\\d)(\\D)', '\\1 \\2', all))
         V1  V2
1    183746 IGH
2 105928759 UBS

Or a vectorised version of the above with a function:

get_alphanum <- function(x, type) {
  type <- switch(type,
                 alpha = '[[:digit:]]+',
                 digit = '[[:alpha:]]+')
  sub(type,'', x)
}

get_alphanum <- Vectorize(get_alphanum)

Which gives a result applied directly on a vector!

> get_alphanum(all, type='alpha')
   183746IGH 105928759UBS 
      " IGH"        "UBS" 
> get_alphanum(all, type='digit')
   183746IGH 105928759UBS 
   " 183746"  "105928759"

which can also be used to create a data.frame:

> data.frame(all, 
             alpha=get_alphanum(all, type='alpha') ,
             numeric=get_alphanum(all, type='digit'))
                      all alpha   numeric
 183746IGH      183746IGH   IGH    183746
105928759UBS 105928759UBS   UBS 105928759

answered Oct 20 '22 22:10

LyzandeR

Related questions
                            
                                Draw Boundary by zip code and create a heat map
                            
                                adding \label{} in kable kableExtra latex output
                            
                                More concise option to `separate` a column in R (maybe through some RegEx)?
                            
                                Creating an adjacency list from a data.frame
                            
                                What's the best trick to speed up a monte carlo simulation? [closed]
                            
                                Dangerous for loop idiom?
                            
                                What does the right parameter do when creating a histogram in R?
                            
                                N Choose K function in R not working--what am I missing?
                            
                                How to suppress warning messages from cast()
                            
                                How can I identify the labels of outliers in a R boxplot?
                            
                                ggplot geom_bar: stack and center
                            
                                Calculate derivative diff() and keep length - add NA [duplicate]
                            
                                Convert numbers to dates [duplicate]
                            
                                shiny: open new browser tab from within shiny app
                            
                                How do I vectorise a function?
                            
                                Replicate rows of a matrix in R
                            
                                Recode numeric values in R
                            
                                Calculate multiple aggregations on several variables using lapply(.SD, ...)
                            
                                Adding attributes in chaining way in dplyr package
                            
                                How to merge images into one file in a defined order

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With