So regular expressions are something that I've always struggled a bit with / never spent the due time learning. In this case, I have an R vector of strings with baseball data in this format:
hit_vector = c("", "Batted ball speed <b>104 mph</b>; distance of <b>381
feet</b>; launch angle of <b>38 degrees</b>.",
"Ball was hit at <b>67 mph</b>.", "", "Ball left the bat at <b>107 mph</b> and traveled a distance of <b>412 feet</b>.",
"Batted ball speed <b>71 mph</b>.", "Ball left the bat at <b>94 mph</b> and traveled a distance of <b>287 feet</b>.",
"", "", "Batted ball speed <b>64 mph</b>.")
> hit_vector
[1] ""
[2] "Batted ball speed <b>104 mph</b>; distance of <b>381 feet</b>; launch angle of <b>38 degrees</b>."
[3] "Ball was hit at <b>67 mph</b>."
[4] ""
[5] "Ball left the bat at <b>107 mph</b> and traveled a distance of <b>412 feet</b>."
[6] "Batted ball speed <b>71 mph</b>."
[7] "Ball left the bat at <b>94 mph</b> and traveled a distance of <b>287 feet</b>."
[8] ""
[9] ""
[10] "Batted ball speed <b>64 mph</b>."
I am trying to create a dataframe with 10 rows that looks like this:
hit_dataframe
speed distance degrees
1. NA NA NA
2. 104 381 38
3. 67 NA NA
4. NA NA NA
5. 107 412 NA
6. 71 NA NA
7. 94 287 NA
8. NA NA NA
9. NA NA NA
10. 64 NA NA
The entire hit_vector is much much longer, but it seems that they all follow this naming convention.
Edit: It looks like the following helps to identify some of the info, but these lines aren't working perfectly (the third line returns all FALSE, which isn't right):
grepl("[0-9]{1,3} mph", hit_vector)
grepl("[0-9]{1,3} feet", hit_vector)
grepl("[0-9]{1,3} degrees", hit_vector)
Edit2: I'm not sure how many digits each stat will be. For example mph could be over 100 (3 digits) and also less than 10 (1 digit).
In this method to extract numbers from character string vector, the user has to call the gsub() function which is one of the inbuilt function of R language, and pass the pattern for the first occurrence of the number in the given strings and the vector of the string as the parameter of this function and in return, this ...
\d for single or multiple digit numbers It will match any single digit number from 0 to 9. \d means [0-9] or match any number from 0 to 9. Instead of writing 0123456789 the shorthand version is [0-9] where [] is used for character range. [1-9][0-9] will match double digit number from 10 to 99.
Regular Expressions (Regex) Regular Expression, or regex or regexp in short, is extremely and amazingly powerful in searching and manipulating text strings, particularly in processing text files. One line of regex can easily replace several dozen lines of programming codes.
using base r:
read.table(text=gsub("\\D+"," ",hit_vector),fill=T,blank.lines.skip = F)
V1 V2 V3
1 NA NA NA
2 104 381 38
3 67 NA NA
4 NA NA NA
5 107 412 NA
6 71 NA NA
7 94 287 NA
8 NA NA NA
9 NA NA NA
10 64 NA NA
Here, just delete everything that is not numeric, ie \\D+
then read in the data, with FILL=T
and without skipping
To take into consideration the comment you made below, then we would need to rearrange our data:
hit_vector1=c(hit_vector,"traveled a distance of <b>412 feet</b>.")
#Take the numbers together with their respective measurements.
a=gsub(".*?(\\d+).*?(mph|feet|degree).*?"," \\1 \\2",hit_vector1)
#Remove the </b>
b=sub("<[/]b>.","",a)
## Any element that does not contain the measurements, invoke an NA
fun=function(x){y=-grep(x,b);b<<-replace(b,y,paste(b[y],NA,x))}
invisible(sapply(c("mph","feet","degrees"),fun))
## Break the line after each measurement and read in a table format
e=gsub("([a-z])\\s","\\1\n",b)
unstack(read.table(text=e))
degrees feet mph
1 NA NA NA
2 38 381 104
3 NA NA 67
4 NA NA NA
5 NA 412 107
6 NA NA 71
7 NA 287 94
8 NA NA NA
9 NA NA NA
10 NA NA 64
11 NA 412 NA
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With