Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to scrape a table with rvest and xpath?

using the following documentation i have been trying to scrape a series of tables from marketwatch.com

here is the one represented by the code bellow:

enter image description here

The link and xpath are already included in the code:

url <- "http://www.marketwatch.com/investing/stock/IRS/profile"
valuation <- url %>%
  html() %>%
  html_nodes(xpath='//*[@id="maincontent"]/div[2]/div[1]') %>%
  html_table()
valuation <- valuation[[1]]

I get the following error:

Warning message:
'html' is deprecated.
Use 'read_html' instead.
See help("Deprecated") 

Thanks in advance.

like image 980
Alex Bădoi Avatar asked Feb 29 '16 19:02

Alex Bădoi


People also ask

What is Xpath scraping?

Xpath is a way to write a pattern that can be matched to a document structure for scraping data. It specifies the parts of a document in a tree structure manner where the parent node is written before the child node inside a pattern.

What is the purpose of Rvest package in R?

rvest helps you scrape (or harvest) data from web pages. It is designed to work with magrittr to make it easy to express common web scraping tasks, inspired by libraries like beautiful soup and RoboBrowser.

How do you scrape data in R studio?

In general, web scraping in R (or in any other language) boils down to the following three steps: Get the HTML for the web page that you want to scrape. Decide what part of the page you want to read and find out what HTML/CSS you need to select it. Select the HTML and analyze it in the way you need.


1 Answers

That website doesn't use an html table, so html_table() can't find anything. It actaully uses div classes column and data lastcolumn.

So you can do something like

url <- "http://www.marketwatch.com/investing/stock/IRS/profile"
valuation_col <- url %>%
  read_html() %>%
  html_nodes(xpath='//*[@class="column"]')
    
valuation_data <- url %>%
  read_html() %>%
  html_nodes(xpath='//*[@class="data lastcolumn"]')

Or even

url %>%
  read_html() %>%
  html_nodes(xpath='//*[@class="section"]')

To get you most of the way there.

Please also read their terms of use - particularly 3.4.

like image 77
SymbolixAU Avatar answered Oct 07 '22 07:10

SymbolixAU