I'm scraping all the text from a website that occurs in a specific class of div. In the following example, I want to extract everything that's in a div of class "a".
site <- "<div class='a'>Hello, world</div>
<div class='b'>Good morning, world</div>
<div class='a'>Good afternoon, world</div>"
My desired output is...
"Hello, world"
"Good afternoon, world"
The code below extracts the text from every div, but I can't figure out how to include only class="a".
library(tidyverse)
library(rvest)
site %>%
read_html() %>%
html_nodes("div") %>%
html_text()
# [1] "Hello, world" "Good morning, world" "Good afternoon, world"
With Python's BeautifulSoup, it would look something like site.find_all("div", class_="a")
.
The CSS selector for div with class = "a"
is div.a
:
site %>%
read_html() %>%
html_nodes("div.a") %>%
html_text()
Or you can use XPath:
html_nodes(xpath = "//div[@class='a']")
site %>%
read_html() %>%
html_nodes(xpath = '//*[@class="a"]') %>%
html_text()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With