Getting Text After a Word--R Webscraping

Tags:

web-scraping

A few weeks ago, someone here helped me immensely get a list of all the links in the Notable Names database. i was able to run this code and get the following output

library(purrr)
library(rvest)
url_base <- "https://www.nndb.com/lists/494/000063305/"    
## Gets A-Z links
all_surname_urls <- read_html(url_base) %>%
      html_nodes(".newslink") %>%
      html_attrs() %>%
      map(pluck(1, 1))

all_ppl_urls <- map(
      all_surname_urls, 
      function(x) read_html(x) %>%
        html_nodes("a") %>%
        html_attrs() %>%
        map(pluck(1, 1))
    ) %>% 
      unlist()

all_ppl_urls <- setdiff(
      all_ppl_urls[!duplicated(all_ppl_urls)], 
      c(all_surname_urls, "http://www.nndb.com/")
    )

all_ppl_urls[1] %>%
      read_html() %>%
      html_nodes("p") %>%
      html_text()

# [1] "AKA Lee William Aaker"
# [2] "Born: 25-Sep-1943Birthplace: Los Angeles, CA"
# [3] "Gender: MaleRace or Ethnicity: WhiteOccupation: Actor"
# [4] "Nationality: United StatesExecutive summary: The Adventures of Rin Tin Tin"
# ...

My original intention was to get a dataframe where i'd get the name of the person, their gender, race, occupation and nationality into a single dataframe.

A lot of the questions I saw here and on other sites was helpful if your data came in an html table and that's not the case with the notable names database. I know a loop needs to be involved for all 40K sites but after a weekend of searching for answers i can't seems to find out how. Can someone assist?

Edited to add I tried following some of the rules here but this request was a bit more complex

## I tried to run list <- all_ppl_urls%>% map(read_html) but that was taking a LONG time so I decided to just get the first ten links for the sake of showing my example:

example <- head(all_ppl_urls, 10)

list  <- example %>% map(read_html)

test <-list  %>% map_df(~{
   text_1 <- html_nodes(.x, 'p , b') %>% html_text

and i got this error: Error: In addition: Warning message: closing unused connection 3 (http://www.nndb.com/people/965/000279128/)

691

asked Apr 08 '19 11:04

tangerine7199

2 Answers

Here you have a way to get data looking at each of your html files. This is just an approach that gets some good results...but... you must notice that those gsub functions should be edited in order to get better results. This happens because that list of urls or, lets say, that webpage, is not homogenized in how data are displayed. This is something you have to deal with. For instance, here are just two screenshots where you can find those differences in web presentation:

enter image description here

Anyway, you can manage this adapting this code:

library(purrr)
library(rvest)

[...] #here is your data

all_ppl_urls[100] %>%
    read_html() %>%
    html_nodes("p") %>%
    html_text()
# [3] "Gender: MaleReligion: Eastern OrthodoxRace or Ethnicity: Middle EasternSexual orientation: StraightOccupation: PoliticianParty Affiliation: Republican"  

#-----------------------------------------------------------------------------------------------
# NEW WAY
toString(read_html(all_ppl_urls[100])) #get example of how html looks...
#><b>AKA</b> Edmund Spencer Abraham</p>\n<p><b>Born:</b> <a href=\"/lists/681/000106363/\" class=\"proflink\">12-Jun</a>-<a href=\"/lists/951/000105636/\" class=\"proflink\">1952</a><br><b>Birthplace:</b> <a href=\"/geo/604/000080364/\" class=\"proflink\">East Lansing, MI</a><br></p>\n<p><b>Gender:</b> Male<br><b>

#1. remove NA urls (avoid problems later on)
urls <- all_ppl_urls[!is.na(all_ppl_urls)]
length(all_ppl_urls)
length(urls)

#function that creates a list with your data
GetLife <- function (htmlurl) {
    htmltext <- toString(read_html(htmlurl))
    name <- gsub('^.*AKA</b>\\s*|\\s*</p>\n.*$', '', htmltext)
    gender <- gsub('^.*Gender:</b>\\s*|\\s*<br>.*$', '', htmltext)
    race <- gsub('^.*Race or Ethnicity:</b>\\s*|\\s*<br>.*$', '', htmltext)
    occupation <- gsub('^.*Occupation:</b>\\s*|\\s*<br>.*$|\\s*</a>.*$|\\s*</p>.*$', '', htmltext)
    #as occupation seems to have to many hyperlinks we are making another step
    occupation <- gsub("<[^>]+>", "",occupation)
    nationality <- gsub('^.*Nationality:</b>\\s*|\\s*<br>.*$', '', htmltext)
    res <- c(ifelse(nchar(name)>100, NA, name), #function that cleans weird results >100 chars
             ifelse(nchar(gender)>100, NA, gender),
             ifelse(nchar(race)>100, NA, race),
             ifelse(nchar(occupation)>100, NA, occupation),
             ifelse(nchar(nationality)>100, NA, nationality),
             htmlurl)
    return(res)
}

emptydf <- data.frame(matrix(ncol=6, nrow=0)) #creaty empty data frame
colnames(emptydf) <- c("name","gender","race","occupation","nationality","url") #set names in empty data frame
urls <- urls[2020:2030] #sample some of the urls
for (i in 1:length(urls)){
    emptydf[i,] <- GetLife(urls[i])
}
emptydf

Here is an example of those 10 urls analized:

name gender     race occupation   nationality                                       url
1                        <NA>   Male    White   Business United States http://www.nndb.com/people/214/000128827/
2  Mark Alexander Ballas, Jr.   Male    White     Dancer United States http://www.nndb.com/people/162/000346121/
3       Thomas Cass Ballenger   Male    White Politician United States http://www.nndb.com/people/354/000032258/
4  Severiano Ballesteros Sota   Male Hispanic       Golf         Spain http://www.nndb.com/people/778/000116430/
5  Richard Achilles Ballinger   Male    White Government United States http://www.nndb.com/people/511/000168007/
6      Steven Anthony Ballmer   Male    White   Business United States http://www.nndb.com/people/644/000022578/
7        Edward Michael Balls   Male    White Politician       England http://www.nndb.com/people/846/000141423/
8                        <NA>   Male    White      Judge United States http://www.nndb.com/people/533/000168029/
9                        <NA>   Male    Asian   Engineer       England http://www.nndb.com/people/100/000123728/
10         Michael A. Balmuth   Male    White   Business United States http://www.nndb.com/people/635/000175110/
11        Aristotle N. Balogh   Male    White   Business United States http://www.nndb.com/people/311/000172792/

126

answered Oct 20 '22 06:10

César Arquero

Update

Included an error routine for profiles which could not be parsed properly. If there is any error you will get an NA row (even if some info could be parsed properly - this is due to the fact that we read all fields at once and we are relying that all fields could be read).

Maybe you want to further develop that code to return partial information? You could do this by reading the fields one after another (instead of once) and if there is an error return NA for this field and not the entire row. This has the downside, however, that the code need to parse the doc not only once per profile but several times.

Here's a function which relies on Xpath to select the relevant fields:

library(rvest)
library(glue)
library(tibble)
library(dplyr)
library(purrr)

scrape_profile <- function(url) {
   fields <- c("Gender:", "Race or Ethnicity:", "Occupation:", "Nationality:")
   filter <- glue("contains(text(), '{fields}')") %>%
                  paste0(collapse = " or ")
   xp_string <- glue("//b[{filter}]/following::text()[normalize-space()!=''][1]") 
   tryCatch({
      doc <- read_html(url)
      name <- doc %>%
                html_node(xpath = "(//b/text())[1]") %>% 
                as.character()
      doc %>%
         html_nodes(xpath = xp_string) %>%
         as.character() %>%
         gsub("^\\s|\\s$", "", .) %>%
         as.list() %>%
         setNames(c("Gender", "Race", "Occupation", "Nationality")) %>%
         as_tibble() %>%
         mutate(Name = name) %>%
         select(Name, everything())
   }, error = function(err) {
      message(glue("Profile <{url}> could not be parsed properly."))
      tibble(Name = ifelse(exists("name"), name, NA), Gender = NA,
             Race = NA, Occupation = NA,
             Nationality = NA)
   })
}

All you have to do now is to apply scrape_profile to all of your profile urls:

map_dfr(all_ppl_urls[1:5], scrape_profile)
# # A tibble: 5 x 5
#   Name                Gender Race  Occupation Nationality  
#   <chr>               <chr>  <chr> <chr>      <chr>        
# 1 Lee Aaker           Male   White Actor      United States
# 2 Aaliyah             Female Black Singer     United States
# 3 Alvar Aalto         Male   White Architect  Finland      
# 4 Willie Aames        Male   White Actor      United States
# 5 Kjetil André Aamodt Male   White Skier      Norway

Explanation

Identify Structure of Website: When looking at the source code of the profile site, you see that all relevant information but the name follows a label in bold (i.e. <b> tags), sometimes there is also a link tag (<a>).
Construct selector: With this information we now can construct either a css or an XPath selector. However, since we want to select text nodes, XPath seems to be the only(?) option: //b[contains(text(), "Gender:")]/following::text()[normalize-space()!=' '][1] selects
- the first non empty text node ::text()[normalize-space()!=' '][1] which is
- a sibling (/following) of
- a <b> tag (//b) which
- contains the text Gender: ([contains(text(), "Gender:")])
Multiple Select: since all tags are built in the same way, we can construct an Xpath which matches more than one element avoiding explicit loops. This we do by pasting several contains(.) statements together separated by or
Further Formatting: Finally we remove whitespaces and return it in a tibble
Name Field: Last step is to extract the name, which is basically the first bold (<b>) text

answered Oct 20 '22 05:10

thothal

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Getting Text After a Word--R Webscraping

Tags:

r

web-scraping

tangerine7199

People also ask

2 Answers

César Arquero

thothal

Recent Activity

Donate For Us

Getting Text After a Word--R Webscraping

Tags:

r

web-scraping

tangerine7199

People also ask

2 Answers

César Arquero

thothal

Related questions

Recent Activity

Donate For Us