Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse javascript data list with R

I use R to parse html code, and I would like to know the most efficient way to sparse the following code :

<script type="text/javascript">
var utag_data = {
  environnement : "prod",
  device : getDevice(),
  displaytype : getDisplay($(window).innerWidth()),
  pagename : "adview",
  pagetype : "annonce"}</script>

I started to do this:

infos = unlist(xpathApply(page,
                          '//script[@type="text/javascript"]',
                          xmlValue))
infos=gsub('\n|  ','',infos)
infos=gsub("var utag_data = ","",infos)
fromJSON(infos)

And the code above returns somthing really weird:

$nvironnemen
[1] "prod"

$evic
NULL

$isplaytyp
NULL

$agenam
[1] "adview" etc.

I would like to know how to do it very efficient way: how to parse directly the data list in the javascript ? Thank you.

like image 868
John Smith Avatar asked May 29 '16 21:05

John Smith


1 Answers

I didn't try your code, but I think your gsub() regexes might be overagressive (which is prbly causing the name munging).

It's possible to run javascript code using the V8 package, but it wont be able to execute the DOM-based getDevice() and getDisplay() functions since they don't exist in the V8 engine:

library(V8)
library(rvest)

pg <- read_html('<script type="text/javascript">
var utag_data = {
  environnement : "prod",
  device : getDevice(),
  displaytype : getDisplay($(window).innerWidth()),
  pagename : "adview",
  pagetype : "annonce"}</script>')


script <- html_text(html_nodes(pg, xpath='//script[@type="text/javascript"]'))

ctx <- v8()

ctx$eval(script)
## Error: ReferenceError: getDevice is not defined

However, you can compensate for that:

# we need to remove the function calls and replace them with blanks
# since both begin with 'getD' this is pretty easy:
script <- gsub("getD[[:alpha:]\\(\\)\\$\\.]+,", "'',", script)  

ctx$eval(script)
ctx$get("utag_data")

## $environnement
## [1] "prod"
## 
## $device
## [1] ""
## 
## $displaytype
## [1] ""
## 
## $pagename
## [1] "adview"
## 
## $pagetype
## [1] "annonce"
like image 69
hrbrmstr Avatar answered Sep 30 '22 05:09

hrbrmstr