Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

RSelenium and findElements with inspect element use

Tags:

r

web-scraping

I would like some help in trying to get each verse of this bible chapter from the following website as a row of strings in a dataframe.

I am struggling to find the correct element/don't know how to use findElements() in conjunction with inspect element in the browser. Any indication of how to do this generally for other bits too, e.g. cross references/footnotes would be great...(note the cross references can be seen by adjusted the 'page options' by clicking on the cog near the top of the page

Below is the code I have attempted.

chapter.url <- "https://www.biblegateway.com/passage/?search=Genesis+50&version=ESV"
library(RSelenium)
RSelenium:::startServer()
remDr <- remoteDriver()
remDr$open()
remDr$navigate(chapter.url)
webElem <- remDr$findElements('id','passage-text')
like image 715
h.l.m Avatar asked Sep 10 '14 09:09

h.l.m


Video Answer


1 Answers

Normally I would target the relevant HTML. Inspecting the page with firefox firebug or something similar we see:

enter image description here

The relevant HTML snippet is <div class="version-ESV result-text-style-normal text-html ">. So we could find the element with class version-ESV:

chapter.url <- "https://www.biblegateway.com/passage/?search=Genesis+50&version=ESV"
library(RSelenium)
RSelenium:::startServer()
remDr <- remoteDriver()
remDr$open()
remDr$navigate(chapter.url)
webElem <- remDr$findElement('class', 'version-ESV')
webElem$highlightElement() # check visually we have the right element

The highlightElement method gives us visual confirmation that we have the required block of HTML. Finally we can get this snippet of HTML using the getElementAttribute method:

appData <- webElem$getElementAttribute("outerHTML")[[1]]

this HTML can then be parsed for the verses using the XML package.

UPDATE:

The various verses contained in a span with an id which starts with "en-ESV-" we can target this using '//span[contains(@id,"en-ESV-")] for an XPATH. However within these code blocks we only want the child nodes that are text nodes. Once we find these text nodes we wish to paste them together seperating with spaces:

appXPATH <- '//span[contains(@id,"en-ESV-")]'
appFunc <- function(x){
  appChildren <- xmlChildren(x)
  out <- appChildren[names(appChildren) == "text"]
  paste(sapply(out, xmlValue), collapse = ' ')
}
doc <- htmlParse(appData, encoding = 'UTF8') # specify encoding
results <- xpathSApply(doc, appXPATH, appFunc)

with the following results:

> head(results)
[1] "Then Joseph  fell on his father's face and wept over him and kissed him."                                                                                                                                                   
[2] "And Joseph commanded his servants the physicians to  embalm his father. So the physicians embalmed Israel."                                                                                                                 
[3] "Forty days were required for it, for that is how many are required for embalming. And the Egyptians  wept for him seventy days."                                                                                            
[4] "And when the days of weeping for him were past, Joseph spoke to the household of Pharaoh, saying,  “If now I have found favor in your eyes, please speak in the ears of Pharaoh, saying,"                                   
[5] "‘My father made me swear, saying, “I am about to die: in my tomb  that I hewed out for myself in the land of Canaan, there shall you bury me.” Now therefore, let me please go up and bury my father. Then I will return.’”"
[6] "And Pharaoh answered, “Go up, and bury your father, as he made you swear.”"                                                                                    
like image 115
jdharrison Avatar answered Sep 21 '22 09:09

jdharrison