Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scraping Youtube comments in R

I'm extracting user comments from a range of websites (like reddit.com) and Youtube is also another juicy source of information for me. My existing scraper is written in R:

# x is the url
html = getURL(x)
doc  = htmlParse(html, asText=TRUE) 
txt  = xpathSApply(doc, 
   //body//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)]",xmlValue) 

This doesn't work on Youtube data, in fact if you look at the source of a Youtube video like this for example, you'd find that comments do not appear in the source.

Does anyone have any suggestions on how to extract data in such circumstances?

Many thanks!

like image 365
IVR Avatar asked Aug 10 '14 01:08

IVR


People also ask

Are you allowed to scrape YouTube?

Most data found on YouTube is accessible to the general public, making it legal to scrape. But it's still important to comply with regulations that deal with personal data and copyright protection. To learn more about the legal context of web scraping, check out our blog article on the subject.

Can R be used for web scraping?

There are several web scraping tools out there to perform the task and various languages too, having libraries that support web scraping. Among all these languages, R is considered as one of the programming languages for Web Scraping because of features like – a rich library, easy to use, dynamically typed, etc.


1 Answers

Following this Answer: R: rvest: scraping a dynamic ecommerce page

You can do the following:

devtools::install_github("ropensci/RSelenium") # Install from github

library(RSelenium)
library(rvest)
pJS <- phantom(pjs_cmd = "PATH TO phantomjs.exe") # as i am using windows
Sys.sleep(5) # give the binary a moment
remDr <- remoteDriver(browserName = 'phantomjs')
remDr$open()
remDr$navigate("https://www.youtube.com/watch?v=qRC4Vk6kisY")
remDr$getTitle()[[1]] # [1] "YouTube"

# scroll down
for(i in 1:5){      
  remDr$executeScript(paste("scroll(0,",i*10000,");"))
  Sys.sleep(3)    
}

# Get page source and parse it via rvest
page_source <- remDr$getPageSource()
author <- html(page_source[[1]]) %>% html_nodes(".user-name") %>% html_text()
text <- html(page_source[[1]]) %>% html_nodes(".comment-text-content") %>% html_text()

#combine the data in a data.frame
dat <- data.frame(author = author, text = text)

Result:
> head(dat)
              author                                                                                       text
1 Kikyo bunny simpie Omg I love fluffy puff she's so adorable when she was dancing on a rainbow it's so cute!!!
2   Tatjana Celinska                                                                                     Ciao 0
3      Yvette Austin                                                                   GET OUT OF MYÂ  HEAD!!!!
4           Susan II                                                                             Watch narhwals
5        Greg Ginger               who in the entire fandom never watched this, should be ashamed,\n\nPFFFTT!!!
6        Arnav Sinha                                                                 LOL what the hell is this?

Comment 1: You do need the github version see rselenium | get youtube page source

Comment 2: This code gives you the initial 44 comments. Some comments have a "show all answers" link that would have to click. Also to see even more comments you have to click the show more button at the bottom of the page. Clicking is explined in this excelent RSelenium tutorial: http://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-basics.html

like image 181
Rentrop Avatar answered Oct 21 '22 12:10

Rentrop