Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scraping location data in rvest

I'm currently trying to scrape latitude/longitude data from a list of urls I have using rvest. Each URL has an embedded google map with a specific location, but the urls themselves don't show the path that the API is taking.

When looking at the page source, I see that the part I'm after is here:

<script type="text/javascript" src="http://maps.google.com/maps/api/js?sensor=false">
</script>
<script type="text/javascript">
function initialize() {
var myLatlng = new google.maps.LatLng(43.805170,-70.722084);
var myOptions = {
  zoom: 16,
  center: myLatlng,
  mapTypeId: google.maps.MapTypeId.SATELLITE
}
var map = new google.maps.Map(document.getElementById("map_canvas"), myOptions);

var marker = new google.maps.Marker({
    position: myLatlng, 
    map: map,
    title:"F.E. Wood & Sons - Natural Energy"
});   

Now, if I can just get the line that has the LatLng(....) input, I can use some string parsing operations to derive the latitude and longitude values for all of the URLs.

I've written the following code to get my data:

require(rvest)
require(magrittr)
fetchLatLong<-function(url){
  url<-as.character(url)
  solNum<-html(url)%>%
    html_nodes("#map_canvas")%>%
    html_attr("script")
}

(the "map_canvas" selector was found using the selectorGadget; you can view the entire source here).

I'm having the worst time getting this to read what I'm after. I've tried many nodes and combinations of nodes, to no avail. I've played around with phantom.js, but the problem is that it's not js-rendered html content I'm after: I'm looking for the API query input, which is written into the page code (or, at least, to my amateur eye appears to be).

Does anyone have any advice?

like image 300
jtexnl Avatar asked Sep 28 '22 16:09

jtexnl


1 Answers

This seems to work:

library(rvest)
library(magrittr)
library(stringr)

pg <- html("http://biomassmagazine.com/plants/view/2285")

pg %>% 
  html_nodes("div.pad20 > script") %>% 
  extract2(2) %>% 
  html_text %>% 
  str_match_all("LatLng\\(([[:digit:]\\.\\-]+),([[:digit:]\\.\\-]+)") %>% 
  extract2(1) %>% 
  extract(2:3) -> lat_lng

lat_lng

## [1] "43.805170"  "-70.722084"
like image 167
hrbrmstr Avatar answered Oct 03 '22 02:10

hrbrmstr