Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Importing wikipedia tables in R

Tags:

dataframe

r

I regularly extract tables from Wikipedia. Excel's web import does not work properly for wikipedia, as it treats the whole page as a table. In google spreadsheet, I can enter this:

=ImportHtml("http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan","table",3)

and this function will download the 3rd table, which lists all the counties of the UP of Michigan, from that page.

Is there something similar in R? or can be created via a user defined function?

like image 698
karlos Avatar asked Sep 13 '11 20:09

karlos


People also ask

Can you download Wikipedia tables?

Convert Wiki Tables to CSV Enter the URL of the Wiki page containing the table(s) and hit "Send". Copy the result to your clipboard or download the table(s) as CSV file(s).


2 Answers

Building on Andrie's answer, and addressing SSL. If you can take one additional library dependency:

library(httr)
library(XML)

url <- "https://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan"

r <- GET(url)

doc <- readHTMLTable(
  doc=content(r, "text"))

doc[6]
like image 185
schnee Avatar answered Oct 20 '22 08:10

schnee


The function readHTMLTable in package XML is ideal for this.

Try the following:

library(XML)
doc <- readHTMLTable(
         doc="http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan")

doc[[6]]

            V1         V2                 V3                              V4
1       County Population Land Area (sq mi) Population Density (per sq mi)
2        Alger      9,862                918                            10.7
3       Baraga      8,735                904                             9.7
4     Chippewa     38,413               1561                            24.7
5        Delta     38,520               1170                            32.9
6    Dickinson     27,427                766                            35.8
7      Gogebic     17,370               1102                            15.8
8     Houghton     36,016               1012                            35.6
9         Iron     13,138               1166                            11.3
10    Keweenaw      2,301                541                             4.3
11        Luce      7,024                903                             7.8
12    Mackinac     11,943               1022                            11.7
13   Marquette     64,634               1821                            35.5
14   Menominee     25,109               1043                            24.3
15   Ontonagon      7,818               1312                             6.0
16 Schoolcraft      8,903               1178                             7.6
17       TOTAL    317,258             16,420                            19.3

readHTMLTable returns a list of data.frames for each element of the HTML page. You can use names to get information about each element:

> names(doc)
 [1] "NULL"                                                                               
 [2] "toc"                                                                                
 [3] "Election results of the 2008 Presidential Election by County in the Upper Peninsula"
 [4] "NULL"                                                                               
 [5] "Cities and Villages of the Upper Peninsula"                                         
 [6] "Upper Peninsula Land Area and Population Density by County"                         
 [7] "19th Century Population by Census Year of the Upper Peninsula by County"            
 [8] "20th & 21st Centuries Population by Census Year of the Upper Peninsula by County"   
 [9] "NULL"                                                                               
[10] "NULL"                                                                               
[11] "NULL"                                                                               
[12] "NULL"                                                                               
[13] "NULL"                                                                               
[14] "NULL"                                                                               
[15] "NULL"                                                                               
[16] "NULL" 
like image 13
Andrie Avatar answered Oct 20 '22 09:10

Andrie