Scraping Youtube comments in R

Tags:

I'm extracting user comments from a range of websites (like reddit.com) and Youtube is also another juicy source of information for me. My existing scraper is written in R:

# x is the url
html = getURL(x)
doc  = htmlParse(html, asText=TRUE) 
txt  = xpathSApply(doc, 
   //body//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)]",xmlValue)

This doesn't work on Youtube data, in fact if you look at the source of a Youtube video like this for example, you'd find that comments do not appear in the source.

Does anyone have any suggestions on how to extract data in such circumstances?

Many thanks!

365

asked Aug 10 '14 01:08

IVR

1 Answers

Following this Answer: R: rvest: scraping a dynamic ecommerce page

You can do the following:

devtools::install_github("ropensci/RSelenium") # Install from github

library(RSelenium)
library(rvest)
pJS <- phantom(pjs_cmd = "PATH TO phantomjs.exe") # as i am using windows
Sys.sleep(5) # give the binary a moment
remDr <- remoteDriver(browserName = 'phantomjs')
remDr$open()
remDr$navigate("https://www.youtube.com/watch?v=qRC4Vk6kisY")
remDr$getTitle()[[1]] # [1] "YouTube"

# scroll down
for(i in 1:5){      
  remDr$executeScript(paste("scroll(0,",i*10000,");"))
  Sys.sleep(3)    
}

# Get page source and parse it via rvest
page_source <- remDr$getPageSource()
author <- html(page_source[[1]]) %>% html_nodes(".user-name") %>% html_text()
text <- html(page_source[[1]]) %>% html_nodes(".comment-text-content") %>% html_text()

#combine the data in a data.frame
dat <- data.frame(author = author, text = text)

Result:
> head(dat)
              author                                                                                       text
1 Kikyo bunny simpie Omg I love fluffy puff she's so adorable when she was dancing on a rainbow it's so cute!!!
2   Tatjana Celinska                                                                                     Ciao 0
3      Yvette Austin                                                                   GET OUT OF MYÂ  HEAD!!!!
4           Susan II                                                                             Watch narhwals
5        Greg Ginger               who in the entire fandom never watched this, should be ashamed,\n\nPFFFTT!!!
6        Arnav Sinha                                                                 LOL what the hell is this?

Comment 1: You do need the github version see rselenium | get youtube page source

Comment 2: This code gives you the initial 44 comments. Some comments have a "show all answers" link that would have to click. Also to see even more comments you have to click the show more button at the bottom of the page. Clicking is explined in this excelent RSelenium tutorial: http://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-basics.html

181

answered Oct 21 '22 12:10

Rentrop

Related questions
                            
                                Get DOM elements by tag name with DOMDocument::loadHTML and getElementsByTagName
                            
                                XPath select one type of nodes only in direct child nodes
                            
                                How do i make Xpath search case insensitive
                            
                                Select child nodes based on their contents
                            
                                nokogiri multiple css classes
                            
                                XPath query for node names matching a certain pattern
                            
                                Regex / DOMDocument - match and replace text not in a link
                            
                                Using xpath to extract data from an XML column in postgres
                            
                                Xpath to search for a node that has ANY attribute containing a specific string?
                            
                                how to select table column by column header name with xpath
                            
                                How to remove brackets from python string?
                            
                                How to get the ROOT node name from SQL Server
                            
                                Default XML namespace, JDOM, and XPath
                            
                                How to use for each group in XSL
                            
                                Xpath - How to get all the attribute names and values of an element
                            
                                JAXB XJC - XPath evaluation results in empty target node?
                            
                                XSLT: How to convert XML Node to String
                            
                                selenium.common.exceptions.ElementClickInterceptedException: Message: element click intercepted: Element is not clickable with Selenium and Python
                            
                                How can I quickly check if a xpath is valid in IE?
                            
                                How do you run an xPath query in IE11?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scraping Youtube comments in R

Tags:

youtube

web-scraping

xpath

IVR

People also ask

1 Answers

Rentrop

Recent Activity

Donate For Us