I'm extracting user comments from a range of websites (like reddit.com) and Youtube is also another juicy source of information for me. My existing scraper is written in R:
# x is the url
html = getURL(x)
doc = htmlParse(html, asText=TRUE)
txt = xpathSApply(doc,
//body//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)]",xmlValue)
This doesn't work on Youtube data, in fact if you look at the source of a Youtube video like this for example, you'd find that comments do not appear in the source.
Does anyone have any suggestions on how to extract data in such circumstances?
Many thanks!
Most data found on YouTube is accessible to the general public, making it legal to scrape. But it's still important to comply with regulations that deal with personal data and copyright protection. To learn more about the legal context of web scraping, check out our blog article on the subject.
There are several web scraping tools out there to perform the task and various languages too, having libraries that support web scraping. Among all these languages, R is considered as one of the programming languages for Web Scraping because of features like – a rich library, easy to use, dynamically typed, etc.
Following this Answer: R: rvest: scraping a dynamic ecommerce page
You can do the following:
devtools::install_github("ropensci/RSelenium") # Install from github
library(RSelenium)
library(rvest)
pJS <- phantom(pjs_cmd = "PATH TO phantomjs.exe") # as i am using windows
Sys.sleep(5) # give the binary a moment
remDr <- remoteDriver(browserName = 'phantomjs')
remDr$open()
remDr$navigate("https://www.youtube.com/watch?v=qRC4Vk6kisY")
remDr$getTitle()[[1]] # [1] "YouTube"
# scroll down
for(i in 1:5){
remDr$executeScript(paste("scroll(0,",i*10000,");"))
Sys.sleep(3)
}
# Get page source and parse it via rvest
page_source <- remDr$getPageSource()
author <- html(page_source[[1]]) %>% html_nodes(".user-name") %>% html_text()
text <- html(page_source[[1]]) %>% html_nodes(".comment-text-content") %>% html_text()
#combine the data in a data.frame
dat <- data.frame(author = author, text = text)
Result:
> head(dat)
author text
1 Kikyo bunny simpie Omg I love fluffy puff she's so adorable when she was dancing on a rainbow it's so cute!!!
2 Tatjana Celinska Ciao 0
3 Yvette Austin GET OUT OF MYÂ HEAD!!!!
4 Susan II Watch narhwals
5 Greg Ginger who in the entire fandom never watched this, should be ashamed,\n\nPFFFTT!!!
6 Arnav Sinha LOL what the hell is this?
Comment 1: You do need the github version see rselenium | get youtube page source
Comment 2: This code gives you the initial 44 comments. Some comments have a "show all answers" link that would have to click. Also to see even more comments you have to click the show more button at the bottom of the page. Clicking is explined in this excelent RSelenium tutorial: http://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-basics.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With