Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting strings from integers in R

Tags:

r

statistics

I've recently come across an interesting problem while trying to create a custom database.

my rows are in form:

 183746IGH
 105928759UBS

and so on (so basically an integer concatenated with a string, both of relatively random sizes.). What I'm trying to do is somehow separate the whole number in column 1 and everything else(the letters) in column 2. How can this be done? I've been trying with strsplit but it doesn't seem to offer this kind of functionality.

Thank you for any help.

like image 306
sdgaw erzswer Avatar asked May 01 '15 22:05

sdgaw erzswer


People also ask

How do you split an integer in R?

To split a number into digits in R, we can use strsplit function by reading the number with as. character and then reading the output with as. numeric.

How do you split a string into numbers?

To split a string into a list of integers: Use the str. split() method to split the string into a list of strings. Use the map() function to convert each string into an integer.


2 Answers

Other options include tstrsplit from the devel version of data.table

library(data.table)#v1.9.5+
setDT(df)[,tstrsplit(V1,'(?<=\\d)(?=\\D)', perl=TRUE, type.convert=TRUE)]
#        V1      V2
#1:   131341    adad
#2:    45365  adadar
#3:      425 cavsbsb
#4: 46567567 daadvsv

If there are elements were 'non-numeric' part appears first and 'numeric' last, then, we can use a bit more generalized option as the regex pattern,

 setDT(df)[,tstrsplit(V1, "(?<=\\d)(?=\\D)|(?<=\\D)(?=\\d)",
                  perl = TRUE)]

Or using extract from tidyr

library(tidyr)
extract(df, V1, into=c('V1', 'V2'), '(\\d+)(\\D+)', convert=TRUE)
#        V1      V2
#1   131341    adad
#2    45365  adadar
#3      425 cavsbsb
#4 46567567 daadvsv

If you need the original column as well,

 extract(df, V1, into=c('V2', 'V3'), '(\\d+)(\\D+)',
                               convert=TRUE, remove=FALSE)
 #               V1       V2      V3
 #1      131341adad   131341    adad
 #2     45365adadar    45365  adadar
 #3      425cavsbsb      425 cavsbsb
 #4 46567567daadvsv 46567567 daadvsv

For the data.table, we can use := to create the new columns so that the existing columns remain in the output, i.e.

setDT(df)[,paste0('V',2:3):=tstrsplit(V1,'(?<=\\d)(?=\\D)',
                     perl=TRUE, type.convert=TRUE)]
#               V1       V2      V3
#1:      131341adad   131341    adad
#2:     45365adadar    45365  adadar
#3:      425cavsbsb      425 cavsbsb
#4: 46567567daadvsv 46567567 daadvsv

NOTE: Both the solutions have the option to convert the class of the split columns (type.convert/convert).

data

df <- data.frame(V1 = c("131341adad", "45365adadar", "425cavsbsb", 
               "46567567daadvsv"))
like image 127
akrun Avatar answered Oct 20 '22 22:10

akrun


And another way with base-R and regular expressions:

all <- c(' 183746IGH','105928759UBS')

numeric <- sapply(a, function(x) sub('[[:alpha:]]+','', x))

alphabetic <- sapply(a, function(x) sub('[[:digit:]]+','', x))

    > data.frame(all,alphabetic,numeric)
                      all alphabetic   numeric
 183746IGH      183746IGH        IGH    183746
105928759UBS 105928759UBS        UBS 105928759

Or as per @rawr's comment below:

> read.table(text = gsub('(\\d)(\\D)', '\\1 \\2', all))
         V1  V2
1    183746 IGH
2 105928759 UBS

Or a vectorised version of the above with a function:

get_alphanum <- function(x, type) {
  type <- switch(type,
                 alpha = '[[:digit:]]+',
                 digit = '[[:alpha:]]+')
  sub(type,'', x)
}

get_alphanum <- Vectorize(get_alphanum)

Which gives a result applied directly on a vector!

> get_alphanum(all, type='alpha')
   183746IGH 105928759UBS 
      " IGH"        "UBS" 
> get_alphanum(all, type='digit')
   183746IGH 105928759UBS 
   " 183746"  "105928759" 

which can also be used to create a data.frame:

> data.frame(all, 
             alpha=get_alphanum(all, type='alpha') ,
             numeric=get_alphanum(all, type='digit'))
                      all alpha   numeric
 183746IGH      183746IGH   IGH    183746
105928759UBS 105928759UBS   UBS 105928759
like image 25
LyzandeR Avatar answered Oct 20 '22 22:10

LyzandeR