Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

php preg_match_all equivalent

Tags:

regex

r

matrix

I am looking for an R equivalent to PHP's preg_match_all function.

Objective:

  • Search a single string (not a vector of several strings) for a regexp pattern
  • Return a matrix of matches

Example:

Assume the following flat string without delimitation.

"This is a sample string written like a paragraph. In this string two sets of information exist. Each set contains two variables. We want to extract the sets and variables within those sets. Each information set is formatted the same way. The first set is Title: Sir; Last Name: John; and the second set is Title: Mr.; Last Name: Smith."

Using a regular expression pattern similar to

"Title: ([^;]*?); Last Name: ([^;.]*?)"

I would like to produce the following matrix from the above string:

[  ][,1]  [,2]
[1,] Sir  John
[2,] Mr.  Smith

I have successfully accomplished this in PHP on a remote server using the preg_match_all function; however, the text files I am accessing are relatively large (not huge but slow to upload anyways). Building this in R will save a significant amount of time.

I have read up on use of grep, etc. in R but every example I have found searches for patterns in a vector and I have been unable to generate the matrix as described above.

I have also played with the stringr package but again I have not been successful generating a matrix.

This seems like a common task to me so I am sure someone smarter than me has found a solution before.

like image 855
AWaddington Avatar asked Dec 03 '25 01:12

AWaddington


1 Answers

Consider the following option using regmatches :

x <- 'This is a sample string written like a paragraph. In this string two sets of information exist. Each set contains two variables. We want to extract the sets and variables within those sets. Each information set is formatted the same way. The first set is Title: Sir; Last Name: John; and the second set is Title: Mr.; Last Name: Smith.'
m <- regmatches(x, gregexpr('(?i)Title: \\K[^;]+|Last Name: \\K[^;.]+', x, perl=T))
matrix(unlist(m), ncol=2, byrow=T)

Output:

     [,1]  [,2]   
[1,] "Sir" "John" 
[2,] "Mr." "Smith"
like image 93
hwnd Avatar answered Dec 05 '25 16:12

hwnd



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!