Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to pass an xpath to html_nodes()?

Tags:

r

xpath

rvest

I want to use html_nodes to scrape organizations' names from the google search results (I need the first element only, assuming that that's gonna be the best guess). Right now, I am trying to target the first result using its xpath, and passing it to the function html_nodes. To find the xpath, I am using google chrome as in the pic below enter image description here

Which gives me //*[@id="rso"]/div[1]/div/div[1]/div/div/h3/a as an xpath for the title of the first result. However, when I try to pass it to html_nodes() I get an empty string:

page %>% html_nodes(xpath='//*[@id="rso"]/div[1]/div/div[1]/div/div/h3/a')
{xml_nodeset (0)}

While I would expect the string The A-Test 2017 Workshop.

How can I get the content of that a tag either with xpath or css?

like image 965
Dambo Avatar asked Nov 07 '22 19:11

Dambo


1 Answers

When scraping websites, selectorgadget is a great tool. Using this I could determine that with google search results, all headings can be found with the following css-tag: .r.

To scrape the results you could therefore use something like this:

library(rvest)

# searching for `rstudio`
page <- read_html("https://www.google.at/search?client=safari&rls=en&q=rstudio&ie=UTF-8&oe=UTF-8&gfe_rd=cr&ei=VpJsWe2oOqqk8wfT5KaQDQ")

page %>% 
  html_nodes(".r") %>%
  html_text()
#>  [1] "RStudio – Open source and enterprise-ready professional software ..."
#>  [2] "Download"                                                            
#>  [3] "Download RStudio Server"                                             
#>  [4] "RStudio Server"                                                      
#>  [5] "Shiny"                                                               
#>  [6] "RStudio – Wikipedia"                                                 
#>  [7] "RStudio - Wikipedia"                                                 
#>  [8] "Datenrettung | R-Studio 8.3 Deutsch | Software zur Datenrettung ..." 
#>  [9] "GitHub - rstudio/rstudio: RStudio is an integrated development ..."  
#> [10] "RStudio · GitHub"                                                    
#> [11] "R-Studio"                                                            
#> [12] "Install RStudio with R Server on HDInsight - Azure | Microsoft Docs"

You can easily find the first one with subsetting:

page %>% 
  html_nodes(".r") %>%
  html_text() %>% 
  .[1]
#> [1] "RStudio – Open source and enterprise-ready professional software ..."

This blog demonstrates the approach more thoroughly: https://blog.rstudio.com/2014/11/24/rvest-easy-web-scraping-with-r/

like image 164
Thomas K Avatar answered Dec 11 '22 17:12

Thomas K