Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

WebScraping in R: extract names from `href` tags

Tags:

r

web-scraping

This is my code:

library(rvest)
library(XML)
library(xml2)
url_imb <- 'https://www.imdb.com/search/title/?count=100&release_date=2016,2016&title_type=feature'
web_page<-read_html(url_imb)

I want to extract all Directors names related to adv_li_dr_0tags.

This is what I did: CSS SELECTOR:

directors_0<-html_text(html_nodes(web_page,"p a"))

XPATH SELECTOR:

directors_0<-html_attr(html_nodes(web_page,xpath='//p[@class=""]//a'),"href")

It is incomplete of course. But can you help me? How to extract elemnts related to a tag in href.

like image 762
Laura Avatar asked Sep 12 '19 15:09

Laura


People also ask

How do I extract text from a website in R?

To extract text from a webpage of interest, we specify what HTML elements we want to select by using html_nodes() . For instance, if we want to scrape the primary heading for the Web Scraping Wikipedia webpage we simply identify the <h1> node as the node we want to select.

What does Read_html do in R?

The read_html command creates an R object, basically a list, that stores information about the web page.

How do you scrape an Rvest?

In general, web scraping in R (or in any other language) boils down to the following three steps: Get the HTML for the web page that you want to scrape. Decide what part of the page you want to read and find out what HTML/CSS you need to select it. Select the HTML and analyze it in the way you need.


2 Answers

Is this what you want?

library(rvest)
library(XML)
library(xml2)
url_imb <- 'https://www.imdb.com/search/title/?count=100&release_date=2016,2016&title_type=feature'
directors <- read_html(url_imb) %>% 
  html_nodes(xpath = "//p[contains(text(),'Director')]/a[contains(@href, '_dr')]") %>% 
  html_text()
like image 58
Mislav Avatar answered Oct 12 '22 23:10

Mislav


I would consider using a css attribute = value selector with contains operator to specify the href attribute must contain the substring adv_li_dr_ . Note I have dropped the 0 on the assumption you want all directors. If you want only the first director for each film then put the 0 in on the end. Note this should be faster and less fragile than xpath.

library(rvest)
library(magrittr)

url_imb <- 'https://www.imdb.com/search/title/?count=100&release_date=2016,2016&title_type=feature'
directors <-read_html(url_imb) %>% html_nodes('[href*=adv_li_dr_]')%>%html_text()

Reading:

  1. Attribute selectors.
like image 31
QHarr Avatar answered Oct 12 '22 21:10

QHarr