Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using regular expressions in R to grab numbers from a string

Tags:

regex

r

So regular expressions are something that I've always struggled a bit with / never spent the due time learning. In this case, I have an R vector of strings with baseball data in this format:

hit_vector = c("", "Batted ball speed <b>104 mph</b>; distance of <b>381 
feet</b>; launch angle of <b>38 degrees</b>.", 
"Ball was hit at <b>67 mph</b>.", "", "Ball left the bat at <b>107 mph</b> and traveled a distance of <b>412 feet</b>.", 
"Batted ball speed <b>71 mph</b>.", "Ball left the bat at <b>94 mph</b> and traveled a distance of <b>287 feet</b>.", 
"", "", "Batted ball speed <b>64 mph</b>.")  

> hit_vector
 [1] ""                                                                                                       
 [2] "Batted ball speed <b>104 mph</b>; distance of <b>381 feet</b>; launch angle of <b>38 degrees</b>."
 [3] "Ball was hit at <b>67 mph</b>."                                                                         
 [4] ""                                                                                                       
 [5] "Ball left the bat at <b>107 mph</b> and traveled a distance of <b>412 feet</b>."                        
 [6] "Batted ball speed <b>71 mph</b>."                                                                       
 [7] "Ball left the bat at <b>94 mph</b> and traveled a distance of <b>287 feet</b>."                         
 [8] ""                                                                                                       
 [9] ""                                                                                                       
[10] "Batted ball speed <b>64 mph</b>."  

I am trying to create a dataframe with 10 rows that looks like this:

hit_dataframe
    speed   distance   degrees
1.     NA         NA        NA
2.    104        381        38
3.     67         NA        NA
4.     NA         NA        NA
5.    107        412        NA
6.     71         NA        NA
7.     94        287        NA
8.     NA         NA        NA
9.     NA         NA        NA
10.    64         NA        NA

The entire hit_vector is much much longer, but it seems that they all follow this naming convention.

Edit: It looks like the following helps to identify some of the info, but these lines aren't working perfectly (the third line returns all FALSE, which isn't right):

grepl("[0-9]{1,3} mph", hit_vector)
grepl("[0-9]{1,3} feet", hit_vector)
grepl("[0-9]{1,3} degrees", hit_vector)

Edit2: I'm not sure how many digits each stat will be. For example mph could be over 100 (3 digits) and also less than 10 (1 digit).

like image 828
Canovice Avatar asked Mar 26 '18 21:03

Canovice


People also ask

How do I extract numbers from a string in R?

In this method to extract numbers from character string vector, the user has to call the gsub() function which is one of the inbuilt function of R language, and pass the pattern for the first occurrence of the number in the given strings and the vector of the string as the parameter of this function and in return, this ...

How do I capture a number in regex?

\d for single or multiple digit numbers It will match any single digit number from 0 to 9. \d means [0-9] or match any number from 0 to 9. Instead of writing 0123456789 the shorthand version is [0-9] where [] is used for character range. [1-9][0-9] will match double digit number from 10 to 99.

What regex means?

Regular Expressions (Regex) Regular Expression, or regex or regexp in short, is extremely and amazingly powerful in searching and manipulating text strings, particularly in processing text files. One line of regex can easily replace several dozen lines of programming codes.


1 Answers

using base r:

read.table(text=gsub("\\D+"," ",hit_vector),fill=T,blank.lines.skip = F)

    V1  V2 V3
1   NA  NA NA
2  104 381 38
3   67  NA NA
4   NA  NA NA
5  107 412 NA
6   71  NA NA
7   94 287 NA
8   NA  NA NA
9   NA  NA NA
10  64  NA NA

Here, just delete everything that is not numeric, ie \\D+ then read in the data, with FILL=T and without skipping

To take into consideration the comment you made below, then we would need to rearrange our data:

hit_vector1=c(hit_vector,"traveled a distance of <b>412 feet</b>.")

#Take the numbers together with their respective measurements.
a=gsub(".*?(\\d+).*?(mph|feet|degree).*?"," \\1 \\2",hit_vector1)

#Remove the </b>
b=sub("<[/]b>.","",a)

## Any element that does not contain the measurements, invoke an NA
fun=function(x){y=-grep(x,b);b<<-replace(b,y,paste(b[y],NA,x))}
invisible(sapply(c("mph","feet","degrees"),fun))

## Break the line after each measurement and read in a table format
e=gsub("([a-z])\\s","\\1\n",b)
unstack(read.table(text=e))
      degrees feet mph
1       NA   NA  NA
2       38  381 104
3       NA   NA  67
4       NA   NA  NA
5       NA  412 107
6       NA   NA  71
7       NA  287  94
8       NA   NA  NA
9       NA   NA  NA
10      NA   NA  64
11      NA  412  NA
like image 100
KU99 Avatar answered Oct 23 '22 06:10

KU99