using the following documentation i have been trying to scrape a series of tables from marketwatch.com
here is the one represented by the code bellow:
The link and xpath are already included in the code:
url <- "http://www.marketwatch.com/investing/stock/IRS/profile"
valuation <- url %>%
html() %>%
html_nodes(xpath='//*[@id="maincontent"]/div[2]/div[1]') %>%
html_table()
valuation <- valuation[[1]]
I get the following error:
Warning message:
'html' is deprecated.
Use 'read_html' instead.
See help("Deprecated")
Thanks in advance.
Xpath is a way to write a pattern that can be matched to a document structure for scraping data. It specifies the parts of a document in a tree structure manner where the parent node is written before the child node inside a pattern.
rvest helps you scrape (or harvest) data from web pages. It is designed to work with magrittr to make it easy to express common web scraping tasks, inspired by libraries like beautiful soup and RoboBrowser.
In general, web scraping in R (or in any other language) boils down to the following three steps: Get the HTML for the web page that you want to scrape. Decide what part of the page you want to read and find out what HTML/CSS you need to select it. Select the HTML and analyze it in the way you need.
That website doesn't use an html table, so html_table()
can't find anything. It actaully uses div
classes column
and data lastcolumn
.
So you can do something like
url <- "http://www.marketwatch.com/investing/stock/IRS/profile"
valuation_col <- url %>%
read_html() %>%
html_nodes(xpath='//*[@class="column"]')
valuation_data <- url %>%
read_html() %>%
html_nodes(xpath='//*[@class="data lastcolumn"]')
Or even
url %>%
read_html() %>%
html_nodes(xpath='//*[@class="section"]')
To get you most of the way there.
Please also read their terms of use - particularly 3.4.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With