Logo Questions Linux Laravel Mysql Ubuntu Git Menu

How to extract keywords from Google search results page URL?





One of the variables in my dataset contains URLs of Google search results pages. I want to extract the search keywords from those URLs.

An example dataset:

keyw <- structure(list(user = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("p1", "p2"), class = "factor"),
                   url = structure(c(3L, 5L, 4L, 1L, 2L, 6L), .Label = c("https://www.google.nl/search?q=five+fingers&ie=utf-8&oe=utf-8&gws_rd=cr,ssl&ei=kERoVbmMO6fp7AaGioCYAw", "https://www.google.nl/search?q=five+fingers&ie=utf-8&oe=utf-8&gws_rd=cr,ssl&ei=kERoVbmMO6fp7AaGioCYAw#safe=off&q=five+short+fingers+", "https://www.google.nl/search?q=high+five&ie=utf-8&oe=utf-8&gws_rd=cr,ssl&ei=bENoVZSqL4ON7Qb5wIDIDg", "https://www.google.nl/search?q=high+five&ie=utf-8&oe=utf-8&gws_rd=cr,ssl&ei=bENoVZSqL4ON7Qb5wIDIDg#safe=off&q=high+five+with+a+chair", "https://www.google.nl/search?q=high+five&ie=utf-8&oe=utf-8&gws_rd=cr,ssl&ei=bENoVZSqL4ON7Qb5wIDIDg#safe=off&q=high+five+with+handshake", "https://www.youtube.com/watch?v=6HOallAdtDI"), class = "factor")), 
              .Names = c("user", "url"), class = "data.frame", row.names = c(NA, -6L))

So far I was able to extract the search keyword parts from the URLs with:

keyw$words <- sapply(str_extract_all(keyw$url, 'q=([^&#]*)'),paste, collapse=",")

However, this still doesn't give me the desired result. The above code gives the following result:

> keyw$words
[1] "q=high+five"                           
[2] "q=high+five,q=high+five+with+handshake"
[3] "q=high+five,q=high+five+with+a+chair"  
[4] "q=five+fingers"                        
[5] "q=five+fingers,q=five+short+fingers+"  
[6] ""                                      

There are three problems with this output:

  1. I only need the words as a string. Instead of q=high+five, I need high,five.
  2. As rows 2, 3 & 5 show, the URL sometimes contains two parts with search keywords. As the first part is merely a reference to the previous search, I only need the second search query.
  3. When the URL is not a Google search page URL, it should return an NA.

The desired result should be:

> keyw$words
[1] "high,five"                           
[2] "high,five,with,handshake"
[3] "high,five,with,a,chair"  
[4] "five,fingers"                        
[5] "five,short,fingers"
[6] NA

How do I solve this?

like image 206
Jaap Avatar asked May 29 '15 13:05


Video Answer

1 Answers

Another update after comment (looks too complex but it's the best I can achieve at this point :)):

keyw$words <- sapply(str_extract_all(str_extract(keyw$url,"https?:[/]{2}[^/]*google.*[/].*"),'(?<=q=|[+])([^$+#&]+)(?!.*q=)'),function(x) if(!length(x)) NA else paste(x,collapse=","))
> keyw$words
[1] "high,five"                "high,five,with,handshake" "high,five,with,a,chair"   "five,fingers"            
[5] "five,short,fingers"       NA             

The change is the filter on input to str_extract_all, changed from the full vector by a "filtered" one to match a regex, any regex can go there to match more or less precisely what you wish.

Here the regex is:

  • http litteraly http
  • s? 0 or 1 s
  • [/]{2} exactly two slashes (using a character class avoid needing the ugly \\/ construction and get things more readable
  • [^/]* any number of not slash characters
  • google.*[/] match litteraly google followed by anything to the last /
  • .* finally match something or not after the last slash

Replace * by + wherever you want to ensure there's a parameter (+ will require the preceding character to be present at least once)

Update heavily inspired by @BrodieG, will return NA if there's no match, but will still match any site if there's q= in the parameters.

Still with the same method:

> keyw$words <- sapply(str_extract_all(keyw$url,'(?:(?<=q=|\\+)([^$+#&]+)(?!.*q=))'),function(x) if(!length(x)) NA else paste(x,collapse=","))
> keyw$words
[1] "high,five"                "high,five,with,handshake" "high,five,with,a,chair"  
[4] "five,fingers"             "five,short,fingers"       NA         

Regex demo

(The lookbehind (?<=) ensure there's q= or + somewhere before the word and the the negative lookahead (?!) ensure we can't find q= untill the end of line.

The character class disallow the + sign to stop at each word.

like image 79
Tensibai Avatar answered Sep 21 '22 03:09
