Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract keywords from Google search results page URL?

Tags:

regex

url

r

One of the variables in my dataset contains URLs of Google search results pages. I want to extract the search keywords from those URLs.

An example dataset:

keyw <- structure(list(user = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("p1", "p2"), class = "factor"),
                   url = structure(c(3L, 5L, 4L, 1L, 2L, 6L), .Label = c("https://www.google.nl/search?q=five+fingers&ie=utf-8&oe=utf-8&gws_rd=cr,ssl&ei=kERoVbmMO6fp7AaGioCYAw", "https://www.google.nl/search?q=five+fingers&ie=utf-8&oe=utf-8&gws_rd=cr,ssl&ei=kERoVbmMO6fp7AaGioCYAw#safe=off&q=five+short+fingers+", "https://www.google.nl/search?q=high+five&ie=utf-8&oe=utf-8&gws_rd=cr,ssl&ei=bENoVZSqL4ON7Qb5wIDIDg", "https://www.google.nl/search?q=high+five&ie=utf-8&oe=utf-8&gws_rd=cr,ssl&ei=bENoVZSqL4ON7Qb5wIDIDg#safe=off&q=high+five+with+a+chair", "https://www.google.nl/search?q=high+five&ie=utf-8&oe=utf-8&gws_rd=cr,ssl&ei=bENoVZSqL4ON7Qb5wIDIDg#safe=off&q=high+five+with+handshake", "https://www.youtube.com/watch?v=6HOallAdtDI"), class = "factor")), 
              .Names = c("user", "url"), class = "data.frame", row.names = c(NA, -6L))

So far I was able to extract the search keyword parts from the URLs with:

keyw$words <- sapply(str_extract_all(keyw$url, 'q=([^&#]*)'),paste, collapse=",")

However, this still doesn't give me the desired result. The above code gives the following result:

> keyw$words
[1] "q=high+five"                           
[2] "q=high+five,q=high+five+with+handshake"
[3] "q=high+five,q=high+five+with+a+chair"  
[4] "q=five+fingers"                        
[5] "q=five+fingers,q=five+short+fingers+"  
[6] ""                                      

There are three problems with this output:

  1. I only need the words as a string. Instead of q=high+five, I need high,five.
  2. As rows 2, 3 & 5 show, the URL sometimes contains two parts with search keywords. As the first part is merely a reference to the previous search, I only need the second search query.
  3. When the URL is not a Google search page URL, it should return an NA.

The desired result should be:

> keyw$words
[1] "high,five"                           
[2] "high,five,with,handshake"
[3] "high,five,with,a,chair"  
[4] "five,fingers"                        
[5] "five,short,fingers"
[6] NA

How do I solve this?

like image 206
Jaap Avatar asked May 29 '15 13:05

Jaap


Video Answer


1 Answers

Another update after comment (looks too complex but it's the best I can achieve at this point :)):

keyw$words <- sapply(str_extract_all(str_extract(keyw$url,"https?:[/]{2}[^/]*google.*[/].*"),'(?<=q=|[+])([^$+#&]+)(?!.*q=)'),function(x) if(!length(x)) NA else paste(x,collapse=","))
> keyw$words
[1] "high,five"                "high,five,with,handshake" "high,five,with,a,chair"   "five,fingers"            
[5] "five,short,fingers"       NA             

The change is the filter on input to str_extract_all, changed from the full vector by a "filtered" one to match a regex, any regex can go there to match more or less precisely what you wish.

Here the regex is:

  • http litteraly http
  • s? 0 or 1 s
  • [/]{2} exactly two slashes (using a character class avoid needing the ugly \\/ construction and get things more readable
  • [^/]* any number of not slash characters
  • google.*[/] match litteraly google followed by anything to the last /
  • .* finally match something or not after the last slash

Replace * by + wherever you want to ensure there's a parameter (+ will require the preceding character to be present at least once)


Update heavily inspired by @BrodieG, will return NA if there's no match, but will still match any site if there's q= in the parameters.

Still with the same method:

> keyw$words <- sapply(str_extract_all(keyw$url,'(?:(?<=q=|\\+)([^$+#&]+)(?!.*q=))'),function(x) if(!length(x)) NA else paste(x,collapse=","))
> keyw$words
[1] "high,five"                "high,five,with,handshake" "high,five,with,a,chair"  
[4] "five,fingers"             "five,short,fingers"       NA         

Regex demo

(The lookbehind (?<=) ensure there's q= or + somewhere before the word and the the negative lookahead (?!) ensure we can't find q= untill the end of line.

The character class disallow the + sign to stop at each word.

like image 79
Tensibai Avatar answered Sep 21 '22 03:09

Tensibai