I'd like to separate column values using tidyr::separate
and a regex expression but am new to regex expressions
df <- data.frame(A=c("enc0","enc10","enc25","enc100","harab0","harab25","harab100","requi0","requi25","requi100"), stringsAsFactors=F)
This is what I've tried
library(tidyr)
df %>%
separate(A, c("name","value"), sep="[a-z]+")
Bad Output
name value
1 0
2 10
3 25
4 100
5 0
# etc
How do I save the name
column as well?
To split a column into multiple columns in the R Language, we use the separator() function of the dplyr package library. The separate() function separates a character column into multiple columns with a regular expression or numeric locations.
Use the split() function in R to split a vector or data frame. Use the unsplit() method to retrieve the split vector or data frame.
You may use a (?<=[a-z])(?=[0-9])
lookaround based regex with tidyr::separate
:
> tidyr::separate(df, A, into = c("name", "value"), "(?<=[a-z])(?=[0-9])")
name value
1 enc 0
2 enc 10
3 enc 25
4 enc 100
5 harab 0
6 harab 25
7 harab 100
8 requi 0
9 requi 25
10 requi 100
The (?<=[a-z])(?=[0-9])
pattern matches a location in the string right in between a lowercase ASCII letter ((?<=[a-z])
) and a digit ((?=[0-9])
). The (?<=...)
is a positive lookahead that requires the presence of some pattern immediately to the left of the current location, and (?=...)
is a positive lookahead that requires the presence of its pattern immediately to the right of the current location. Thus, the letters and digits are kept intact when splitting.
Alternatively, you may use extract
:
extract(df, A, into = c("name", "value"), "^([a-z]+)(\\d+)$")
Output:
name value
1 enc 0
2 enc 10
3 enc 25
4 enc 100
5 harab 0
6 harab 25
7 harab 100
8 requi 0
9 requi 25
10 requi 100
The ^([a-z]+)(\\d+)$
pattern matches:
^
- start of input([a-z]+)
- Capturing group 1 (column name
): one or more lowercase ASCII letters(\\d+)
- Capturing group 2 (column value
): one or more digits$
- end of string.For a bare R version without a lookaround-based regex, define the regular expression first:
> re <- "[a-zA-Z][0-9]"
Then use two substr()
commands to separate and return the desired two components, before and after the matched pattern.
> with(df,
data.frame(name=substr(A, 1L, regexpr(re, A)),
value=substr(A, regexpr(re, A) + 1L, 1000L))
)
name value
1 enc 0
2 enc 10
3 enc 25
4 enc 100
5 harab 0
6 harab 25
7 harab 100
8 requi 0
9 requi 25
10 requi 100
The regex here looks for the pattern "any alpha" [a-zA-Z]
followed by "any numeric" [0-9]
. I believe this is what the reshape
command does if the sep
argument is specified as "".
You can add one more step If you really want to get it with separate
, in which I don't see the point, i.e. (Using the same regex as @ WiktorStribiżew),
df %>%
mutate(A = gsub('^([a-z]+)(\\d+)$', '\\1_\\2', A)) %>%
separate(A, into = c('name', 'value'), sep = '_')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With