Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting strings into number and string (with missings)

I am trying to separate numbers and characters in a column of strings. So far I have been using tidyr::separate for doing this, but am encountering errors for "unusual" cases.

Suppose I have the following data

df <- data.frame(c1 = c("5.5K", "2M", "3.1", "M"))

And I want to obtain a data frame with columns

data.frame(c2 = c("5.5", "2", "3.1", NA),
c3 = c("K", "M", NA, "M))

So far I have been using tidyr::separate

df %>%
separate(c1, into =c("c2", "c3"), sep = "(?<=[0-9])(?=[A-Za-z])")

But this only works for the first three cases. I realize this is because ?<=... and ?=... require the presence of the regex. How would one modify this code to capture the cases where the numbers are missing before the letters? Been trying to use the extract function too, but without success.

Edit: I suppose one solution is to break this up into

df$col2 <- as.numeric(str_extract(df$col1, "[0-9]+"))
df$col3 <- (str_extract(df$col1, "[aA-zZ]+"))

But I was curious whether were other ways to handle it.

like image 449
user11151932 Avatar asked Apr 16 '19 03:04

user11151932


People also ask

How do you separate text and numbers in Python?

Use the re. split() method to split a string into text and number, e.g. my_list = re. split(r'(\d+)', my_str) .

How do I split a string into string?

The split() method splits a string into an array of substrings. The split() method returns the new array. The split() method does not change the original string. If (" ") is used as separator, the string is split between words.

How do I split a string into multiple substrings?

split() The method split() splits a String into multiple Strings given the delimiter that separates them. The returned object is an array which contains the split Strings. We can also pass a limit to the number of elements in the returned array.


1 Answers

extract(df, c1, into =c("c2", "c3"), "([\\.\\d]*)([a-zA-Z]*)")
#    c2 c3
# 1 5.5  K
# 2   2  M
# 3 3.1   
# 4      M

You can use seperate simply in this way, but there should be a more elegant method..

df %>% separate(c1, into =c("c2", "c3"), sep = "(?=[A-Za-z])")
#    c2   c3
# 1 5.5    K
# 2   2    M
# 3 3.1 <NA>
# 4        M
like image 121
VicaYang Avatar answered Oct 13 '22 20:10

VicaYang