Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scraping the content of all div tags with a specific class

Tags:

r

rvest

I'm scraping all the text from a website that occurs in a specific class of div. In the following example, I want to extract everything that's in a div of class "a".

site <- "<div class='a'>Hello, world</div>
  <div class='b'>Good morning, world</div>
  <div class='a'>Good afternoon, world</div>"

My desired output is...

"Hello, world"
"Good afternoon, world"

The code below extracts the text from every div, but I can't figure out how to include only class="a".

library(tidyverse)
library(rvest)

site %>% 
  read_html() %>% 
  html_nodes("div") %>% 
  html_text()

# [1] "Hello, world"          "Good morning, world"   "Good afternoon, world"

With Python's BeautifulSoup, it would look something like site.find_all("div", class_="a").

like image 426
Andrew Brēza Avatar asked Jan 22 '18 00:01

Andrew Brēza


2 Answers

The CSS selector for div with class = "a" is div.a:

site %>% 
  read_html() %>% 
  html_nodes("div.a") %>% 
  html_text()

Or you can use XPath:

html_nodes(xpath = "//div[@class='a']")
like image 163
neilfws Avatar answered Oct 31 '22 23:10

neilfws


site %>% 
  read_html() %>% 
  html_nodes(xpath = '//*[@class="a"]') %>% 
  html_text()
like image 5
DJack Avatar answered Nov 01 '22 00:11

DJack