Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's my user agent when I parse website with rvest package in R?

Since it is easy in R, I am using rvest package to parse HTML to extract informations from website.

I am wondering what's my User-Agent (if there is any) during the request, since User-Agent is assigned to the internet browser or is there a way to set it somehow?

My code that open session and extract informations from HTML is below:

library(rvest)
se <- html_session( "http://www.wp.pl" ) %>% 
html_nodes("[data-st-area=Glonews-mozaika] li:nth-child(7) a") %>%
html_attr( name = "href" )
like image 549
Marcin Kosiński Avatar asked Jul 14 '15 12:07

Marcin Kosiński


People also ask

What is User Agent in web scraping?

Besides a browser, a user agent could be a bot scraping webpages, a download manager, or another app accessing the Web. Along with each request they make to the server, browsers include a self-identifying User-Agent HTTP header called a user agent (UA) string.

What is the purpose of Rvest package in R?

rvest helps you scrape (or harvest) data from web pages. It is designed to work with magrittr to make it easy to express common web scraping tasks, inspired by libraries like beautiful soup and RoboBrowser.

How do you scrape data using Rvest?

In general, web scraping in R (or in any other language) boils down to the following three steps: Get the HTML for the web page that you want to scrape. Decide what part of the page you want to read and find out what HTML/CSS you need to select it. Select the HTML and analyze it in the way you need.


2 Answers

I used https://httpbin.org/user-agent to find out:

library(rvest)
se <- html_session( "https://httpbin.org/user-agent" )
se$response$request$options$useragent

Answer:

[1] "libcurl/7.37.1 r-curl/0.9.1 httr/1.0.0"

See this bug report for a way to override it.

like image 60
Michael Kohl Avatar answered Sep 21 '22 01:09

Michael Kohl


I found this somewhere in a tutorial, it looks like an easier faster way to do it:

uastring <- "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
session <- html_session("https://www.linkedin.com/job/", user_agent(uastring))
like image 37
Theodore Van Rooy Avatar answered Sep 19 '22 01:09

Theodore Van Rooy