I am trying to extract informations from source code of a specific website
In the source code there are lines:
# [[4]]
# <script type="text/javascript">
# <![CDATA[
# <!-- // <![CDATA[
# var wp_dot_addparams = {
# "cid": "148938",
# "ctype": "article",
# "ctags": "dziejesiewkulturze,piraci z karaibów,Charlie Hebdo,Scorpions",
# "cauthor": "",
# "csource": "film.wp.pl",
# "cpageno": 1,
# "cpagemax": 1,
# "cdate": "2015-02-18"
# };
# // ]]]]><![CDATA[> -->
# ]]>
# </script>
From which I'd like to extract:
"ctags": "dziejesiewkulturze,piraci z karaibów,Charlie Hebdo,Scorpions",
Does anyone know how I should specify the selector in html_nodes
function in rvest
package in R?
html("http://film.wp.pl/id,148938,title,dziejesiewkulturze-Codzienna-dawka-informacji-kulturalnych-180215-WIDEO,wiadomosc.html") %>%
html_nodes("script")
Extract the JSON object from the element's text (tidy the selector up while you're at it)
Parse it as a list using jsonlite's fromJSON() function.
You can access it directly using "$ctags"
library(jsonlite)
json <- html("http://film.wp.pl/id,148938,title,dziejesiewkulturze-Codzienna-dawka-informacji-kulturalnych-180215-WIDEO,wiadomosc.html") %>%
html_nodes("script:contains('var wp_dot_addparams')") %>%
gsub(x=., pattern=".*var wp_dot_addparams = (\\{.*\\});.*",replacement="\\1") %>%
fromJSON()
json$ctags
[1] "dziejesiewkulturze,piraci z karaibów,Charlie Hebdo,Scorpions"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With