Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract websites links from a text in R

Tags:

r

I have multiple texts that each may consist references to one or more web links. for example:

 text1= "s@1212a as www.abcd.com asasa11". 

How do I extract:

   "www.abcd.com" 

from this text in R? In other words I am looking to extract patterns that start with www and end with .com

like image 449
user1848018 Avatar asked Mar 22 '13 20:03

user1848018


People also ask

How do I extract text from a link?

Click and drag to select the text on the Web page you want to extract and press “Ctrl-C” to copy the text. Open a text editor or document program and press “Ctrl-V” to paste the text from the Web page into the text file or document window. Save the text file or document to your computer.

How do I read data from a website in R?

In general, web scraping in R (or in any other language) boils down to the following three steps: Get the HTML for the web page that you want to scrape. Decide what part of the page you want to read and find out what HTML/CSS you need to select it. Select the HTML and analyze it in the way you need.

How do I extract a URL from a text file in Python?

URL extraction is achieved from a text file by using regular expression. The expression fetches the text wherever it matches the pattern. Only the re module is used for this purpose.


2 Answers

regmatches This approach uses regexpr/grepgexpr and regmatches. I expanded the test data to include more examples.

text1 <- c("s@1212a www.abcd.com www.cats.com", 
           "www.boo.com", 
           "asdf",
           "blargwww.test.comasdf")

# Regular expressions take some practice.
# check out ?regex or the wikipedia page on regular expressions
# for more info on creating them yourself.
pattern <- "www\\..*?\\.com"
# Get information about where the pattern matches text1
m <- gregexpr(pattern, text1)
# Extract the matches from text1
regmatches(text1, m)

Which gives

> regmatches(text1, m) ##
[[1]]
[1] "www.abcd.com" "www.cats.com"

[[2]]
[1] "www.boo.com"

[[3]]
character(0)

[[4]]
[1] "www.test.com"

Notice it returns a list. If we want a vector you can just use unlist on the result. This is because we used gregexpr which implies there could be multiple matches in our string. If we know there is at most one match we could use regexpr instead

> m <- regexpr(pattern, text1)
> regmatches(text1, m)
[1] "www.abcd.com" "www.boo.com"  "www.test.com"

Notice, however, that this returns all results as a vector and only returns a single result from each string (note that www.cats.com isn't in the results). On the whole, though, I think either of these two methods is preferable to the gsub method because that way will return the entire input if there is no result found. For example take a look:

> gsub(text1, pattern=".*(www\\..*?\\.com).*", replace="\\1")
[1] "www.abcd.com" "www.boo.com"  "asdf"         "www.test.com"

And that's even after modifying the pattern to be a little more robust. We still get 'asdf' in the results even though it clearly doesn't match the pattern.

Shameless silly self promotion: regmatches was introduced with R 2.14 so if you're stuck with an earlier version of R you might be out of luck. Unless you're able to install the future2.14 package from my github repo which provides some support for functions introduced in 2.14 to earlier versions of R.

strapplyc. An alternative which gives the same result as ## above is:

library(gsubfn)
strapplyc(test1, pattern)

The regular expression Here is some explanation on how to decipher the regular expression:

pattern <- "www\\..*?\\.com"

Explanation:

www matches the www portion

\\. We need to escape an actual 'dot' character using \\ because a plain . represents "any character" in regular expressions.

.*? The . represents any character, the * tells to match 0 or more times, and the ? following the * tells it to not be greedy. Otherwise "asdf www.cats.com www.dogs.com asdf" would match all of "www.cats.com www.dogs.com" as a single match instead of recognizing that there are two matches in there.

\\. Once again we need to escape an actual dot character

com This part matches the ending 'com' that we want to match

Putting it all together it says: start with www. then match any characters until you reach the first ".com"

like image 75
Dason Avatar answered Oct 23 '22 10:10

Dason


Check out the gsub function:

x = "s@1212a as www.abcd.com asasa11"
gsub(x=x, pattern=".*(www.*com).*", replace="\\1")

The basic idea is to surround the txt you want to retain in parenthesis, then replace the entire line with it. The replace parameter of gsub "\\1" refers to what was found in the parenthesis.

like image 26
kith Avatar answered Oct 23 '22 10:10

kith