I am trying to separate numbers and characters in a column of strings. So far I have been using tidyr::separate
for doing this, but am encountering errors for "unusual" cases.
Suppose I have the following data
df <- data.frame(c1 = c("5.5K", "2M", "3.1", "M"))
And I want to obtain a data frame with columns
data.frame(c2 = c("5.5", "2", "3.1", NA),
c3 = c("K", "M", NA, "M))
So far I have been using tidyr::separate
df %>%
separate(c1, into =c("c2", "c3"), sep = "(?<=[0-9])(?=[A-Za-z])")
But this only works for the first three cases. I realize this is because ?<=...
and ?=...
require the presence of the regex. How would one modify this code to capture the cases where the numbers are missing before the letters? Been trying to use the extract
function too, but without success.
Edit: I suppose one solution is to break this up into
df$col2 <- as.numeric(str_extract(df$col1, "[0-9]+"))
df$col3 <- (str_extract(df$col1, "[aA-zZ]+"))
But I was curious whether were other ways to handle it.
Use the re. split() method to split a string into text and number, e.g. my_list = re. split(r'(\d+)', my_str) .
The split() method splits a string into an array of substrings. The split() method returns the new array. The split() method does not change the original string. If (" ") is used as separator, the string is split between words.
split() The method split() splits a String into multiple Strings given the delimiter that separates them. The returned object is an array which contains the split Strings. We can also pass a limit to the number of elements in the returned array.
extract(df, c1, into =c("c2", "c3"), "([\\.\\d]*)([a-zA-Z]*)")
# c2 c3
# 1 5.5 K
# 2 2 M
# 3 3.1
# 4 M
You can use seperate
simply in this way, but there should be a more elegant method..
df %>% separate(c1, into =c("c2", "c3"), sep = "(?=[A-Za-z])")
# c2 c3
# 1 5.5 K
# 2 2 M
# 3 3.1 <NA>
# 4 M
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With