A few weeks ago, someone here helped me immensely get a list of all the links in the Notable Names database. i was able to run this code and get the following output
library(purrr)
library(rvest)
url_base <- "https://www.nndb.com/lists/494/000063305/"
## Gets A-Z links
all_surname_urls <- read_html(url_base) %>%
html_nodes(".newslink") %>%
html_attrs() %>%
map(pluck(1, 1))
all_ppl_urls <- map(
all_surname_urls,
function(x) read_html(x) %>%
html_nodes("a") %>%
html_attrs() %>%
map(pluck(1, 1))
) %>%
unlist()
all_ppl_urls <- setdiff(
all_ppl_urls[!duplicated(all_ppl_urls)],
c(all_surname_urls, "http://www.nndb.com/")
)
all_ppl_urls[1] %>%
read_html() %>%
html_nodes("p") %>%
html_text()
# [1] "AKA Lee William Aaker"
# [2] "Born: 25-Sep-1943Birthplace: Los Angeles, CA"
# [3] "Gender: MaleRace or Ethnicity: WhiteOccupation: Actor"
# [4] "Nationality: United StatesExecutive summary: The Adventures of Rin Tin Tin"
# ...
My original intention was to get a dataframe where i'd get the name of the person, their gender, race, occupation and nationality into a single dataframe.
A lot of the questions I saw here and on other sites was helpful if your data came in an html table and that's not the case with the notable names database. I know a loop needs to be involved for all 40K sites but after a weekend of searching for answers i can't seems to find out how. Can someone assist?
Edited to add I tried following some of the rules here but this request was a bit more complex
## I tried to run list <- all_ppl_urls%>% map(read_html) but that was taking a LONG time so I decided to just get the first ten links for the sake of showing my example:
example <- head(all_ppl_urls, 10)
list <- example %>% map(read_html)
test <-list %>% map_df(~{
text_1 <- html_nodes(.x, 'p , b') %>% html_text
and i got this error: Error: In addition: Warning message: closing unused connection 3 (http://www.nndb.com/people/965/000279128/)
A process of collating a collection of webpages by starting with an initial list of URLs (or links) and systematically processing each page to extract content and additional links.
There are several web scraping tools out there to perform the task and various languages too, having libraries that support web scraping. Among all these languages, R is considered as one of the programming languages for Web Scraping because of features like – a rich library, easy to use, dynamically typed, etc.
rvest is new package that makes it easy to scrape (or harvest) data from html web pages, inspired by libraries like beautiful soup. It is designed to work with magrittr so that you can express complex operations as elegant pipelines composed of simple, easily understood pieces.
If you already know R, scraping data from web pages is relatively straightforward. Web pages can be converted to data frames or CSV files for further analysis. This tutorial covers the basics of web scraping with R.
Before proceeding into the concepts of NLTK, let us understand the relation between text analysis and web scraping. Analyzing the words in the text can lead us to know about which words are important, which words are unusual, how words are grouped. This analysis eases the task of web scraping.
I’ll throw in one more use case here which is to scrape tables off such HTML pages. And it is something that you’ll encounter quite frequently too for web scraping purposes. XML package in R offers a function named readHTMLTable () which makes our life so easy when it comes to scraping tables from HTML pages.
Web Scraping is one of the important methods to retrieve third-party data automatically. In this article, I will be covering the basics of web scraping and use two examples to illustrate the 2 different ways to do it in Python. Web Scraping is an automatic way to retrieve unstructured data from a website and store them in a structured format.
Here you have a way to get data looking at each of your html files. This is just an approach that gets some good results...but... you must notice that those gsub functions should be edited in order to get better results. This happens because that list of urls or, lets say, that webpage, is not homogenized in how data are displayed. This is something you have to deal with. For instance, here are just two screenshots where you can find those differences in web presentation:
Anyway, you can manage this adapting this code:
library(purrr)
library(rvest)
[...] #here is your data
all_ppl_urls[100] %>%
read_html() %>%
html_nodes("p") %>%
html_text()
# [3] "Gender: MaleReligion: Eastern OrthodoxRace or Ethnicity: Middle EasternSexual orientation: StraightOccupation: PoliticianParty Affiliation: Republican"
#-----------------------------------------------------------------------------------------------
# NEW WAY
toString(read_html(all_ppl_urls[100])) #get example of how html looks...
#><b>AKA</b> Edmund Spencer Abraham</p>\n<p><b>Born:</b> <a href=\"/lists/681/000106363/\" class=\"proflink\">12-Jun</a>-<a href=\"/lists/951/000105636/\" class=\"proflink\">1952</a><br><b>Birthplace:</b> <a href=\"/geo/604/000080364/\" class=\"proflink\">East Lansing, MI</a><br></p>\n<p><b>Gender:</b> Male<br><b>
#1. remove NA urls (avoid problems later on)
urls <- all_ppl_urls[!is.na(all_ppl_urls)]
length(all_ppl_urls)
length(urls)
#function that creates a list with your data
GetLife <- function (htmlurl) {
htmltext <- toString(read_html(htmlurl))
name <- gsub('^.*AKA</b>\\s*|\\s*</p>\n.*$', '', htmltext)
gender <- gsub('^.*Gender:</b>\\s*|\\s*<br>.*$', '', htmltext)
race <- gsub('^.*Race or Ethnicity:</b>\\s*|\\s*<br>.*$', '', htmltext)
occupation <- gsub('^.*Occupation:</b>\\s*|\\s*<br>.*$|\\s*</a>.*$|\\s*</p>.*$', '', htmltext)
#as occupation seems to have to many hyperlinks we are making another step
occupation <- gsub("<[^>]+>", "",occupation)
nationality <- gsub('^.*Nationality:</b>\\s*|\\s*<br>.*$', '', htmltext)
res <- c(ifelse(nchar(name)>100, NA, name), #function that cleans weird results >100 chars
ifelse(nchar(gender)>100, NA, gender),
ifelse(nchar(race)>100, NA, race),
ifelse(nchar(occupation)>100, NA, occupation),
ifelse(nchar(nationality)>100, NA, nationality),
htmlurl)
return(res)
}
emptydf <- data.frame(matrix(ncol=6, nrow=0)) #creaty empty data frame
colnames(emptydf) <- c("name","gender","race","occupation","nationality","url") #set names in empty data frame
urls <- urls[2020:2030] #sample some of the urls
for (i in 1:length(urls)){
emptydf[i,] <- GetLife(urls[i])
}
emptydf
Here is an example of those 10 urls analized:
name gender race occupation nationality url
1 <NA> Male White Business United States http://www.nndb.com/people/214/000128827/
2 Mark Alexander Ballas, Jr. Male White Dancer United States http://www.nndb.com/people/162/000346121/
3 Thomas Cass Ballenger Male White Politician United States http://www.nndb.com/people/354/000032258/
4 Severiano Ballesteros Sota Male Hispanic Golf Spain http://www.nndb.com/people/778/000116430/
5 Richard Achilles Ballinger Male White Government United States http://www.nndb.com/people/511/000168007/
6 Steven Anthony Ballmer Male White Business United States http://www.nndb.com/people/644/000022578/
7 Edward Michael Balls Male White Politician England http://www.nndb.com/people/846/000141423/
8 <NA> Male White Judge United States http://www.nndb.com/people/533/000168029/
9 <NA> Male Asian Engineer England http://www.nndb.com/people/100/000123728/
10 Michael A. Balmuth Male White Business United States http://www.nndb.com/people/635/000175110/
11 Aristotle N. Balogh Male White Business United States http://www.nndb.com/people/311/000172792/
Update
Included an error routine for profiles which could not be parsed properly. If there is any error you will get an NA
row (even if some info could be parsed properly - this is due to the fact that we read all fields at once and we are relying that all fields could be read).
Maybe you want to further develop that code to return partial information? You could do this by reading the fields one after another (instead of once) and if there is an error return NA for this field and not the entire row. This has the downside, however, that the code need to parse the doc not only once per profile but several times.
Here's a function which relies on Xpath
to select the relevant fields:
library(rvest)
library(glue)
library(tibble)
library(dplyr)
library(purrr)
scrape_profile <- function(url) {
fields <- c("Gender:", "Race or Ethnicity:", "Occupation:", "Nationality:")
filter <- glue("contains(text(), '{fields}')") %>%
paste0(collapse = " or ")
xp_string <- glue("//b[{filter}]/following::text()[normalize-space()!=''][1]")
tryCatch({
doc <- read_html(url)
name <- doc %>%
html_node(xpath = "(//b/text())[1]") %>%
as.character()
doc %>%
html_nodes(xpath = xp_string) %>%
as.character() %>%
gsub("^\\s|\\s$", "", .) %>%
as.list() %>%
setNames(c("Gender", "Race", "Occupation", "Nationality")) %>%
as_tibble() %>%
mutate(Name = name) %>%
select(Name, everything())
}, error = function(err) {
message(glue("Profile <{url}> could not be parsed properly."))
tibble(Name = ifelse(exists("name"), name, NA), Gender = NA,
Race = NA, Occupation = NA,
Nationality = NA)
})
}
All you have to do now is to apply scrape_profile
to all of your profile urls:
map_dfr(all_ppl_urls[1:5], scrape_profile)
# # A tibble: 5 x 5
# Name Gender Race Occupation Nationality
# <chr> <chr> <chr> <chr> <chr>
# 1 Lee Aaker Male White Actor United States
# 2 Aaliyah Female Black Singer United States
# 3 Alvar Aalto Male White Architect Finland
# 4 Willie Aames Male White Actor United States
# 5 Kjetil André Aamodt Male White Skier Norway
Explanation
<b>
tags), sometimes there is also a link tag (<a>
).css
or an XPath
selector. However, since we want to select text nodes, XPath
seems to be the only(?) option: //b[contains(text(), "Gender:")]/following::text()[normalize-space()!=' '][1]
selects
::text()[normalize-space()!=' '][1]
which is/following
) of<b>
tag (//b)
which Gender:
([contains(text(), "Gender:")]
)Xpath
which matches more than one element avoiding explicit loops. This we do by pasting several contains(.)
statements together separated by or
tibble
<b>
) textIf you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With