I use R to parse html code, and I would like to know the most efficient way to sparse the following code :
<script type="text/javascript">
var utag_data = {
environnement : "prod",
device : getDevice(),
displaytype : getDisplay($(window).innerWidth()),
pagename : "adview",
pagetype : "annonce"}</script>
I started to do this:
infos = unlist(xpathApply(page,
'//script[@type="text/javascript"]',
xmlValue))
infos=gsub('\n| ','',infos)
infos=gsub("var utag_data = ","",infos)
fromJSON(infos)
And the code above returns somthing really weird:
$nvironnemen
[1] "prod"
$evic
NULL
$isplaytyp
NULL
$agenam
[1] "adview" etc.
I would like to know how to do it very efficient way: how to parse directly the data list in the javascript ? Thank you.
I didn't try your code, but I think your gsub()
regexes might be overagressive (which is prbly causing the name munging).
It's possible to run javascript code using the V8
package, but it
wont be able to execute the DOM-based getDevice()
and getDisplay()
functions since they don't exist in the V8 engine:
library(V8)
library(rvest)
pg <- read_html('<script type="text/javascript">
var utag_data = {
environnement : "prod",
device : getDevice(),
displaytype : getDisplay($(window).innerWidth()),
pagename : "adview",
pagetype : "annonce"}</script>')
script <- html_text(html_nodes(pg, xpath='//script[@type="text/javascript"]'))
ctx <- v8()
ctx$eval(script)
## Error: ReferenceError: getDevice is not defined
However, you can compensate for that:
# we need to remove the function calls and replace them with blanks
# since both begin with 'getD' this is pretty easy:
script <- gsub("getD[[:alpha:]\\(\\)\\$\\.]+,", "'',", script)
ctx$eval(script)
ctx$get("utag_data")
## $environnement
## [1] "prod"
##
## $device
## [1] ""
##
## $displaytype
## [1] ""
##
## $pagename
## [1] "adview"
##
## $pagetype
## [1] "annonce"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With