Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract numbers from strings including '|'

I have data where some of the items are numbers separated by "|", like:

head(mintimes)
[1] "3121|3151" "1171"      "1351|1381" "1050"      ""          "122" 
head(minvalues)
[1] 14    10    11    31 Inf    22

What I would like to do is extract all the times and match them to the minvalues. To end up with something like:

times    values
3121     14
3151     14
1171     10
1351     11
1381     11
1050     31
122      22

I've tried to strsplit(mintimes, "|") and I've tried str_extract(mintimes, "[0-9]+") but they don't seem to work. Any ideas?

like image 424
ThatGuy Avatar asked Jun 17 '14 00:06

ThatGuy


4 Answers

| is a regular expression metacharacter. When used literally, these special characters need to be escaped either with [] or with \\ (or you could use fixed = TRUE in some functions). So your call to strsplit() should be

strsplit(mintimes, "[|]")

or

strsplit(mintimes, "\\|")

or

strsplit(mintimes, "|", fixed = TRUE)

Regarding your other try with stringr functions, str_extract_all() seems to do the trick.

library(stringr)
str_extract_all(mintimes, "[0-9]+")

To get your desired result,

> mintimes <- c("3121|3151", "1171", "1351|1381", "1050", "", "122") 
> minvalues <- c(14, 10, 11, 31, Inf, 22)
> s <- strsplit(mintimes, "[|]")
> data.frame(times = as.numeric(unlist(s)), 
             values = rep(minvalues, sapply(s, length)))
#   times values
# 1  3121     14
# 2  3151     14
# 3  1171     10
# 4  1351     11
# 5  1381     11
# 6  1050     31
# 7   122     22
like image 160
Rich Scriven Avatar answered Oct 11 '22 02:10

Rich Scriven


By default strsplit splits using a regular expression and "|" is a special character in the regular expression syntax. You can either escape it

strsplit(mintimes,"\\|")

or just set fixed=T to not use regular expressions

strsplit(mintimes,"|", fixed=T)
like image 38
MrFlick Avatar answered Oct 11 '22 02:10

MrFlick


I have written a function called cSplit that is useful for these types of things. You can get it from my Gist: https://gist.github.com/mrdwab/11380733

Usage would be:

cSplit(data.table(mintimes, minvalues), "mintimes", "|", "long")
#    mintimes minvalues
# 1:     3121        14
# 2:     3151        14
# 3:     1171        10
# 4:     1351        11
# 5:     1381        11
# 6:     1050        31
# 7:      122        22

It also has a "wide" setting, in case that would be at all useful to you:

cSplit(data.table(mintimes, minvalues), "mintimes", "|", "wide")
#    minvalues mintimes_1 mintimes_2
# 1:        14       3121       3151
# 2:        10       1171         NA
# 3:        11       1351       1381
# 4:        31       1050         NA
# 5:       Inf         NA         NA
# 6:        22        122         NA

Note: The output is a data.table.

like image 36
A5C1D2H2I1M1N2O1R2T1 Avatar answered Oct 11 '22 01:10

A5C1D2H2I1M1N2O1R2T1


As others have mentioned, you need to escape the | to include it literally in a regular expression. As always, we can skin this cat many ways, and here's one way to do it with stringr:

x <- c("3121|3151", "1171", "1351|1381", "1050", "", "122")

library(stringr)
unlist(str_extract_all(x, "\\d+"))

# [1] "3121" "3151" "1171" "1351" "1381" "1050" "122"

This won't work as expected if you have any decimal points in a character string of numbers, so the following (which says to match anything but |) might be safer:

unlist(str_extract_all(x, '[^|]+'))

# [1] "3121" "3151" "1171" "1351" "1381" "1050" "122" 

Either way, you might want to wrap the result in as.numeric.

like image 34
jbaums Avatar answered Oct 11 '22 03:10

jbaums