I have multiple texts that each may consist references to one or more web links. for example: <pre class="prettyprint"><code> text1= "s@1212a as www.abcd.com asasa11". </code></pre> How do I extract: <pre class="prettyprint"><code> "www.abcd.com" </code></pre> from this text in R? In other words I am looking to extract patterns that start with <code>www</code> and end with <code>.com</code>

regmatches This approach uses <code>regexpr</code>/<code>grepgexpr</code> and <code>regmatches</code>. I expanded the test data to include more examples. <pre class="prettyprint"><code>text1 <- c("s@1212a www.abcd.com www.cats.com", "www.boo.com", "asdf", "blargwww.test.comasdf") # Regular expressions take some practice. # check out ?regex or the wikipedia page on regular expressions # for more info on creating them yourself. pattern <- "www\\..*?\\.com" # Get information about where the pattern matches text1 m <- gregexpr(pattern, text1) # Extract the matches from text1 regmatches(text1, m) </code></pre> Which gives <pre class="prettyprint"><code>> regmatches(text1, m) ## [[1]] [1] "www.abcd.com" "www.cats.com" [[2]] [1] "www.boo.com" [[3]] character(0) [[4]] [1] "www.test.com" </code></pre> Notice it returns a list. If we want a vector you can just use <code>unlist</code> on the result. This is because we used <code>gregexpr</code> which implies there could be multiple matches in our string. If we know there is at most one match we could use <code>regexpr</code> instead <pre class="prettyprint"><code>> m <- regexpr(pattern, text1) > regmatches(text1, m) [1] "www.abcd.com" "www.boo.com" "www.test.com" </code></pre> Notice, however, that this returns all results as a vector and only returns a single result from each string (note that www.cats.com isn't in the results). On the whole, though, I think either of these two methods is preferable to the <code>gsub</code> method because that way will return the entire input if there is no result found. For example take a look: <pre class="prettyprint"><code>> gsub(text1, pattern=".*(www\\..*?\\.com).*", replace="\\1") [1] "www.abcd.com" "www.boo.com" "asdf" "www.test.com" </code></pre> And that's even after modifying the pattern to be a little more robust. We still get 'asdf' in the results even though it clearly doesn't match the pattern. Shameless silly self promotion: <code>regmatches</code> was introduced with R 2.14 so if you're stuck with an earlier version of R you might be out of luck. Unless you're able to install the future2.14 package from my github repo which provides some support for functions introduced in 2.14 to earlier versions of R. strapplyc. An alternative which gives the same result as ## above is: <pre class="prettyprint"><code>library(gsubfn) strapplyc(test1, pattern) </code></pre> The regular expression Here is some explanation on how to decipher the regular expression: <pre class="prettyprint"><code>pattern <- "www\\..*?\\.com" </code></pre> Explanation: <code>www</code> matches the www portion <code>\\.</code> We need to escape an actual 'dot' character using <code>\\</code> because a plain <code>.</code> represents "any character" in regular expressions. <code>.*?</code> The <code>.</code> represents any character, the <code>*</code> tells to match 0 or more times, and the <code>?</code> following the <code>*</code> tells it to not be greedy. Otherwise "asdf www.cats.com www.dogs.com asdf" would match all of "www.cats.com www.dogs.com" as a single match instead of recognizing that there are two matches in there. <code>\\.</code> Once again we need to escape an actual dot character <code>com</code> This part matches the ending 'com' that we want to match Putting it all together it says: start with www. then match any characters until you reach the first ".com"

Extract websites links from a text in R

Tags:

r

I have multiple texts that each may consist references to one or more web links. for example:

 text1= "s@1212a as www.abcd.com asasa11".

How do I extract:

   "www.abcd.com"

from this text in R? In other words I am looking to extract patterns that start with www and end with .com

449

asked Mar 22 '13 20:03

user1848018

2 Answers

regmatches This approach uses regexpr/grepgexpr and regmatches. I expanded the test data to include more examples.

text1 <- c("s@1212a www.abcd.com www.cats.com", 
           "www.boo.com", 
           "asdf",
           "blargwww.test.comasdf")

# Regular expressions take some practice.
# check out ?regex or the wikipedia page on regular expressions
# for more info on creating them yourself.
pattern <- "www\\..*?\\.com"
# Get information about where the pattern matches text1
m <- gregexpr(pattern, text1)
# Extract the matches from text1
regmatches(text1, m)

Which gives

> regmatches(text1, m) ##
[[1]]
[1] "www.abcd.com" "www.cats.com"

[[2]]
[1] "www.boo.com"

[[3]]
character(0)

[[4]]
[1] "www.test.com"

Notice it returns a list. If we want a vector you can just use unlist on the result. This is because we used gregexpr which implies there could be multiple matches in our string. If we know there is at most one match we could use regexpr instead

> m <- regexpr(pattern, text1)
> regmatches(text1, m)
[1] "www.abcd.com" "www.boo.com"  "www.test.com"

Notice, however, that this returns all results as a vector and only returns a single result from each string (note that www.cats.com isn't in the results). On the whole, though, I think either of these two methods is preferable to the gsub method because that way will return the entire input if there is no result found. For example take a look:

> gsub(text1, pattern=".*(www\\..*?\\.com).*", replace="\\1")
[1] "www.abcd.com" "www.boo.com"  "asdf"         "www.test.com"

And that's even after modifying the pattern to be a little more robust. We still get 'asdf' in the results even though it clearly doesn't match the pattern.

Shameless silly self promotion: regmatches was introduced with R 2.14 so if you're stuck with an earlier version of R you might be out of luck. Unless you're able to install the future2.14 package from my github repo which provides some support for functions introduced in 2.14 to earlier versions of R.

strapplyc. An alternative which gives the same result as ## above is:

library(gsubfn)
strapplyc(test1, pattern)

The regular expression Here is some explanation on how to decipher the regular expression:

pattern <- "www\\..*?\\.com"

Explanation:

www matches the www portion

\\. We need to escape an actual 'dot' character using \\ because a plain . represents "any character" in regular expressions.

.*? The . represents any character, the * tells to match 0 or more times, and the ? following the * tells it to not be greedy. Otherwise "asdf www.cats.com www.dogs.com asdf" would match all of "www.cats.com www.dogs.com" as a single match instead of recognizing that there are two matches in there.

\\. Once again we need to escape an actual dot character

com This part matches the ending 'com' that we want to match

Putting it all together it says: start with www. then match any characters until you reach the first ".com"

answered Oct 23 '22 10:10

Dason

Check out the gsub function:

x = "s@1212a as www.abcd.com asasa11"
gsub(x=x, pattern=".*(www.*com).*", replace="\\1")

The basic idea is to surround the txt you want to retain in parenthesis, then replace the entire line with it. The replace parameter of gsub "\\1" refers to what was found in the parenthesis.

answered Oct 23 '22 10:10

kith

Related questions
                            
                                Shiny node reactivity dependency tree
                            
                                Flow map(Travel Path) Using Lat and Long in R
                            
                                Deploy shiny app in rocker/shiny docker
                            
                                merge list of data frames by different ids
                            
                                Why is the sum of the area under density curve always greater than 1 (R)?
                            
                                Reading Multiple CSV files as data frames in R
                            
                                Changing CRS of a SF object
                            
                                Create matrix row-index which increments when rowsum > 100, and following row
                            
                                Select ggtheme randomly
                            
                                How do I generate data where points are 'repelled' if they land within a certain proximity to another point?
                            
                                Does R have a way to say "do n times" without having to write a for loop with a redundant variable?
                            
                                Reshape data from long to wide, with time in new wide variable name
                            
                                Using ddply inside a function
                            
                                R: How can I replace let's say the 5th element within a string?
                            
                                Date sequence with negative by
                            
                                Check if character value is a valid R object name
                            
                                R: Interaction Plot with a continuous and a categorical variable for a GLMM (lme4)
                            
                                R tilde operator: What does ~0+a means?
                            
                                Replace entire expression that contains a specific string
                            
                                How can I perform a pairwise t.test in R across multiple independent vectors?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With