I've recently come across an interesting problem while trying to create a custom database.
my rows are in form:
183746IGH
105928759UBS
and so on (so basically an integer concatenated with a string, both of relatively random sizes.). What I'm trying to do is somehow separate the whole number in column 1 and everything else(the letters) in column 2. How can this be done? I've been trying with strsplit but it doesn't seem to offer this kind of functionality.
Thank you for any help.
To split a number into digits in R, we can use strsplit function by reading the number with as. character and then reading the output with as. numeric.
To split a string into a list of integers: Use the str. split() method to split the string into a list of strings. Use the map() function to convert each string into an integer.
Other options include tstrsplit
from the devel version of data.table
library(data.table)#v1.9.5+
setDT(df)[,tstrsplit(V1,'(?<=\\d)(?=\\D)', perl=TRUE, type.convert=TRUE)]
# V1 V2
#1: 131341 adad
#2: 45365 adadar
#3: 425 cavsbsb
#4: 46567567 daadvsv
If there are elements were 'non-numeric' part appears first and 'numeric' last, then, we can use a bit more generalized option as the regex pattern,
setDT(df)[,tstrsplit(V1, "(?<=\\d)(?=\\D)|(?<=\\D)(?=\\d)",
perl = TRUE)]
Or using extract
from tidyr
library(tidyr)
extract(df, V1, into=c('V1', 'V2'), '(\\d+)(\\D+)', convert=TRUE)
# V1 V2
#1 131341 adad
#2 45365 adadar
#3 425 cavsbsb
#4 46567567 daadvsv
If you need the original column as well,
extract(df, V1, into=c('V2', 'V3'), '(\\d+)(\\D+)',
convert=TRUE, remove=FALSE)
# V1 V2 V3
#1 131341adad 131341 adad
#2 45365adadar 45365 adadar
#3 425cavsbsb 425 cavsbsb
#4 46567567daadvsv 46567567 daadvsv
For the data.table
, we can use :=
to create the new columns so that the existing columns remain in the output, i.e.
setDT(df)[,paste0('V',2:3):=tstrsplit(V1,'(?<=\\d)(?=\\D)',
perl=TRUE, type.convert=TRUE)]
# V1 V2 V3
#1: 131341adad 131341 adad
#2: 45365adadar 45365 adadar
#3: 425cavsbsb 425 cavsbsb
#4: 46567567daadvsv 46567567 daadvsv
NOTE: Both the solutions have the option to convert the class of the split columns (type.convert/convert
).
df <- data.frame(V1 = c("131341adad", "45365adadar", "425cavsbsb",
"46567567daadvsv"))
And another way with base-R and regular expressions:
all <- c(' 183746IGH','105928759UBS')
numeric <- sapply(a, function(x) sub('[[:alpha:]]+','', x))
alphabetic <- sapply(a, function(x) sub('[[:digit:]]+','', x))
> data.frame(all,alphabetic,numeric)
all alphabetic numeric
183746IGH 183746IGH IGH 183746
105928759UBS 105928759UBS UBS 105928759
Or as per @rawr's comment below:
> read.table(text = gsub('(\\d)(\\D)', '\\1 \\2', all))
V1 V2
1 183746 IGH
2 105928759 UBS
Or a vectorised version of the above with a function:
get_alphanum <- function(x, type) {
type <- switch(type,
alpha = '[[:digit:]]+',
digit = '[[:alpha:]]+')
sub(type,'', x)
}
get_alphanum <- Vectorize(get_alphanum)
Which gives a result applied directly on a vector!
> get_alphanum(all, type='alpha')
183746IGH 105928759UBS
" IGH" "UBS"
> get_alphanum(all, type='digit')
183746IGH 105928759UBS
" 183746" "105928759"
which can also be used to create a data.frame:
> data.frame(all,
alpha=get_alphanum(all, type='alpha') ,
numeric=get_alphanum(all, type='digit'))
all alpha numeric
183746IGH 183746IGH IGH 183746
105928759UBS 105928759UBS UBS 105928759
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With