Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting title from link in R

I'm practicing web scraping with the rvest package in R. This page has is a great guide so far. (http://zevross.com/blog/2015/05/19/scrape-website-data-with-the-new-r-package-rvest/). Using the tool Selector Gadget I can identify the class or div element reference to the items I want (as far as I understand).

So I just went to Wikipedia and am trying to extract the list of the U.S. Presidents. The link to that page is https://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States. Selector Gadget told me that the element class/div/???? (not sure what to call it) is "big a".

Here's my code so far:

site = read_html("https://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States")
fnames = html_nodes(site,"big a")

And a partial output is:

{xml_nodeset (44)}
 [1] <a href="/wiki/George_Washington" title="George Washington">George Washington</a>
 [2] <a href="/wiki/John_Adams" title="John Adams">John Adams</a>
 [3] <a href="/wiki/Thomas_Jefferson" title="Thomas Jefferson">Thomas Jefferson</a>
 [4] <a href="/wiki/James_Madison" title="James Madison">James Madison</a>
 [5] <a href="/wiki/James_Monroe" title="James Monroe">James Monroe</a>
 [6] <a href="/wiki/John_Quincy_Adams" title="John Quincy Adams">John Quincy Adams</a>
 [7] <a href="/wiki/Andrew_Jackson" title="Andrew Jackson">Andrew Jackson</a>
 [8] <a href="/wiki/Martin_Van_Buren" title="Martin Van Buren">Martin Van Buren</a>

Great! So I have extracted the names with links! I just want the names though so I'm not sure how to proceed here. Is there a way to easily grab the names between the link html code? Or should I grab another element instead with the html_nodes function? I feel like I'm close!

Thank you for any help.

like image 424
user137698 Avatar asked Jun 07 '16 19:06

user137698


1 Answers

There's two sources for the names. The title attribute and the text. They may be formatted slightly differently, or one may include middle initials or whatever. Use the one you like best.

html_attr(fnames, "title")

OR

html_text(fnames)

like image 157
cory Avatar answered Nov 13 '22 10:11

cory