Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which selector to write in rvest package in R?

I am trying to extract informations from source code of a specific website

In the source code there are lines:

# [[4]]
# <script type="text/javascript">
#   <![CDATA[
#     <!-- // <![CDATA[
#       var wp_dot_addparams = {
#         "cid": "148938",
#         "ctype": "article",
#         "ctags": "dziejesiewkulturze,piraci z karaibów,Charlie Hebdo,Scorpions",
#         "cauthor": "",
#         "csource": "film.wp.pl",
#         "cpageno": 1,
#         "cpagemax": 1,
#         "cdate": "2015-02-18"
#       };
#       // ]]]]><![CDATA[> -->
#                          ]]>
#   </script> 

From which I'd like to extract:

"ctags": "dziejesiewkulturze,piraci z karaibów,Charlie Hebdo,Scorpions",

Does anyone know how I should specify the selector in html_nodes function in rvest package in R?

html("http://film.wp.pl/id,148938,title,dziejesiewkulturze-Codzienna-dawka-informacji-kulturalnych-180215-WIDEO,wiadomosc.html") %>%
  html_nodes("script")
like image 390
Marcin Kosiński Avatar asked Mar 30 '15 14:03

Marcin Kosiński


1 Answers

  1. Extract the JSON object from the element's text (tidy the selector up while you're at it)

  2. Parse it as a list using jsonlite's fromJSON() function.

  3. You can access it directly using "$ctags"

    library(jsonlite)
    
    json <- html("http://film.wp.pl/id,148938,title,dziejesiewkulturze-Codzienna-dawka-informacji-kulturalnych-180215-WIDEO,wiadomosc.html") %>%
      html_nodes("script:contains('var wp_dot_addparams')") %>%
      gsub(x=., pattern=".*var wp_dot_addparams = (\\{.*\\});.*",replacement="\\1") %>%
      fromJSON()
    
    json$ctags
    
    [1] "dziejesiewkulturze,piraci z karaibów,Charlie Hebdo,Scorpions"
    
like image 181
Robert Kingston Avatar answered Oct 07 '22 06:10

Robert Kingston