So regular expressions are something that I've always struggled a bit with / never spent the due time learning. In this case, I have an R vector of strings with baseball data in this format: <pre class="prettyprint"><code>hit_vector = c("", "Batted ball speed 104 mph; distance of 381 feet; launch angle of 38 degrees.", "Ball was hit at 67 mph.", "", "Ball left the bat at 107 mph and traveled a distance of 412 feet.", "Batted ball speed 71 mph.", "Ball left the bat at 94 mph and traveled a distance of 287 feet.", "", "", "Batted ball speed 64 mph.") > hit_vector [1] "" [2] "Batted ball speed 104 mph; distance of 381 feet; launch angle of 38 degrees." [3] "Ball was hit at 67 mph." [4] "" [5] "Ball left the bat at 107 mph and traveled a distance of 412 feet." [6] "Batted ball speed 71 mph." [7] "Ball left the bat at 94 mph and traveled a distance of 287 feet." [8] "" [9] "" [10] "Batted ball speed 64 mph." </code></pre> I am trying to create a dataframe with 10 rows that looks like this: <pre class="prettyprint"><code>hit_dataframe speed distance degrees 1. NA NA NA 2. 104 381 38 3. 67 NA NA 4. NA NA NA 5. 107 412 NA 6. 71 NA NA 7. 94 287 NA 8. NA NA NA 9. NA NA NA 10. 64 NA NA </code></pre> The entire hit_vector is much much longer, but it seems that they all follow this naming convention. Edit: It looks like the following helps to identify some of the info, but these lines aren't working perfectly (the third line returns all FALSE, which isn't right): <pre class="prettyprint"><code>grepl("[0-9]{1,3} mph", hit_vector) grepl("[0-9]{1,3} feet", hit_vector) grepl("[0-9]{1,3} degrees", hit_vector) </code></pre> Edit2: I'm not sure how many digits each stat will be. For example mph could be over 100 (3 digits) and also less than 10 (1 digit).

using base r: <pre class="prettyprint"><code>read.table(text=gsub("\\D+"," ",hit_vector),fill=T,blank.lines.skip = F) V1 V2 V3 1 NA NA NA 2 104 381 38 3 67 NA NA 4 NA NA NA 5 107 412 NA 6 71 NA NA 7 94 287 NA 8 NA NA NA 9 NA NA NA 10 64 NA NA </code></pre> Here, just delete everything that is not numeric, ie <code>\\D+</code> then read in the data, with <code>FILL=T</code> and without skipping To take into consideration the comment you made below, then we would need to rearrange our data: <pre class="prettyprint"><code>hit_vector1=c(hit_vector,"traveled a distance of 412 feet.") #Take the numbers together with their respective measurements. a=gsub(".*?(\\d+).*?(mph|feet|degree).*?"," \\1 \\2",hit_vector1) #Remove the b=sub("<[/]b>.","",a) ## Any element that does not contain the measurements, invoke an NA fun=function(x){y=-grep(x,b);b<<-replace(b,y,paste(b[y],NA,x))} invisible(sapply(c("mph","feet","degrees"),fun)) ## Break the line after each measurement and read in a table format e=gsub("([a-z])\\s","\\1\n",b) unstack(read.table(text=e)) degrees feet mph 1 NA NA NA 2 38 381 104 3 NA NA 67 4 NA NA NA 5 NA 412 107 6 NA NA 71 7 NA 287 94 8 NA NA NA 9 NA NA NA 10 NA NA 64 11 NA 412 NA </code></pre>

Using regular expressions in R to grab numbers from a string

Tags:

regex

r

So regular expressions are something that I've always struggled a bit with / never spent the due time learning. In this case, I have an R vector of strings with baseball data in this format:

hit_vector = c("", "Batted ball speed <b>104 mph</b>; distance of <b>381 
feet</b>; launch angle of <b>38 degrees</b>.", 
"Ball was hit at <b>67 mph</b>.", "", "Ball left the bat at <b>107 mph</b> and traveled a distance of <b>412 feet</b>.", 
"Batted ball speed <b>71 mph</b>.", "Ball left the bat at <b>94 mph</b> and traveled a distance of <b>287 feet</b>.", 
"", "", "Batted ball speed <b>64 mph</b>.")  

> hit_vector
 [1] ""                                                                                                       
 [2] "Batted ball speed <b>104 mph</b>; distance of <b>381 feet</b>; launch angle of <b>38 degrees</b>."
 [3] "Ball was hit at <b>67 mph</b>."                                                                         
 [4] ""                                                                                                       
 [5] "Ball left the bat at <b>107 mph</b> and traveled a distance of <b>412 feet</b>."                        
 [6] "Batted ball speed <b>71 mph</b>."                                                                       
 [7] "Ball left the bat at <b>94 mph</b> and traveled a distance of <b>287 feet</b>."                         
 [8] ""                                                                                                       
 [9] ""                                                                                                       
[10] "Batted ball speed <b>64 mph</b>."

I am trying to create a dataframe with 10 rows that looks like this:

hit_dataframe
    speed   distance   degrees
1.     NA         NA        NA
2.    104        381        38
3.     67         NA        NA
4.     NA         NA        NA
5.    107        412        NA
6.     71         NA        NA
7.     94        287        NA
8.     NA         NA        NA
9.     NA         NA        NA
10.    64         NA        NA

The entire hit_vector is much much longer, but it seems that they all follow this naming convention.

Edit: It looks like the following helps to identify some of the info, but these lines aren't working perfectly (the third line returns all FALSE, which isn't right):

grepl("[0-9]{1,3} mph", hit_vector)
grepl("[0-9]{1,3} feet", hit_vector)
grepl("[0-9]{1,3} degrees", hit_vector)

Edit2: I'm not sure how many digits each stat will be. For example mph could be over 100 (3 digits) and also less than 10 (1 digit).

828

asked Mar 26 '18 21:03

Canovice

1 Answers

using base r:

read.table(text=gsub("\\D+"," ",hit_vector),fill=T,blank.lines.skip = F)

    V1  V2 V3
1   NA  NA NA
2  104 381 38
3   67  NA NA
4   NA  NA NA
5  107 412 NA
6   71  NA NA
7   94 287 NA
8   NA  NA NA
9   NA  NA NA
10  64  NA NA

Here, just delete everything that is not numeric, ie \\D+ then read in the data, with FILL=T and without skipping

To take into consideration the comment you made below, then we would need to rearrange our data:

hit_vector1=c(hit_vector,"traveled a distance of <b>412 feet</b>.")

#Take the numbers together with their respective measurements.
a=gsub(".*?(\\d+).*?(mph|feet|degree).*?"," \\1 \\2",hit_vector1)

#Remove the </b>
b=sub("<[/]b>.","",a)

## Any element that does not contain the measurements, invoke an NA
fun=function(x){y=-grep(x,b);b<<-replace(b,y,paste(b[y],NA,x))}
invisible(sapply(c("mph","feet","degrees"),fun))

## Break the line after each measurement and read in a table format
e=gsub("([a-z])\\s","\\1\n",b)
unstack(read.table(text=e))
      degrees feet mph
1       NA   NA  NA
2       38  381 104
3       NA   NA  67
4       NA   NA  NA
5       NA  412 107
6       NA   NA  71
7       NA  287  94
8       NA   NA  NA
9       NA   NA  NA
10      NA   NA  64
11      NA  412  NA

100

answered Oct 23 '22 06:10

KU99

Related questions
                            
                                r - Filter rows that contain a string from a vector
                            
                                Predicted vs. Actual plot
                            
                                Is is possible to convert a dataframe object to a tribble constructor?
                            
                                Efficient string similarity grouping
                            
                                In R, sample n rows from a df in which a certain column has non-NA values (sample conditionally)
                            
                                How do I plot a classification graph of a SVM in R
                            
                                How to define argument types for R functions?
                            
                                Pivot Table-like Output in R?
                            
                                How to search for multiple strings and replace them with nothing within a list of strings
                            
                                How to add row on-top of data frame R
                            
                                How do I convert date to number of days in R
                            
                                The reverse/inverse of the normal distribution function in R
                            
                                R: fastest way to extract all substrings contained between two substrings
                            
                                Randomly insert NAs into dataframe proportionaly
                            
                                R - loess prediction returns NA
                            
                                How can I screenshot a website using R?
                            
                                Remove entire list elements which contain a certain string
                            
                                street address to geolocation lat/long
                            
                                Find the longest continuous chunk of TRUE in a boolean vector
                            
                                R Markdown add White Space to HTML Output

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With