Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrape Data from Wikipedia

I am trying to find or build a web scraper that is able to go through and find every state/national park in the US along with their GPS coordinates and land area. I have looked into some frameworks like Scrapy and then I see there are some sites that are specifically for Wikipedia such as http://wiki.dbpedia.org/About. Is there any specific advantage to either one of these or would either one work better to load the information into an online database?

like image 437
clifgray Avatar asked Sep 13 '12 21:09

clifgray


3 Answers

Let's suppose you want to parse pages like this Wikipedia page. The following code should work.

var doc = new HtmlDocument(); 
doc = .. //Load the document here. See doc.Load(..), doc.LoadHtml(..), etc.

//We get all the rows from the table (except the header)
var rows = doc.DocumentNode.SelectNodes("//table[contains(@class, 'sortable')]//tr").Skip(1);
foreach (var row in rows) {
    var name = HttpUtility.HtmlDecode(row.SelectSingleNode("./*[1]/a[@href and @title]").InnerText);
    var loc = HttpUtility.HtmlDecode(row.SelectSingleNode(".//span[@class='geo-dec']").InnerText);
    var areaNodes = row.SelectSingleNode("./*[5]").ChildNodes.Skip(1);
    string area = "";
    foreach (var a in areaNodes) {
        area += HttpUtility.HtmlDecode(a.InnerText);
    }

    Console.WriteLine("{0,-30} {1,-20} {2,-10}", name, loc, area);
}

I tested it, and it produces the following output:

Acadia                         44.35A°N 68.21A°W    47,389.67 acres (191.8 km2)
American Samoa                 14.25A°S 170.68A°W   9,000.00 acres (36.4 km2)
Arches                         38.68A°N 109.57A°W   76,518.98 acres (309.7 km2)
Badlands                       43.75A°N 102.50A°W   242,755.94 acres (982.4 km2)
Big Bend                       29.25A°N 103.25A°W   801,163.21 acres (3,242.2 km2)
Biscayne                       25.65A°N 80.08A°W    172,924.07 acres (699.8 km2)
Black Canyon of the Gunnison   38.57A°N 107.72A°W   32,950.03 acres (133.3 km2)
Bryce Canyon                   37.57A°N 112.18A°W   35,835.08 acres (145.0 km2)
Canyonlands                    38.2A°N 109.93A°W    337,597.83 acres (1,366.2 km2)
Capitol Reef                   38.20A°N 111.17A°W   241,904.26 acres (979.0 km2)
Carlsbad Caverns               32.17A°N 104.44A°W   46,766.45 acres (189.3 km2)
Channel Islands                34.01A°N 119.42A°W   249,561.00 acres (1,009.9 km2)
Congaree                       33.78A°N 80.78A°W    26,545.86 acres (107.4 km2)
Crater Lake                    42.94A°N 122.1A°W    183,224.05 acres (741.5 km2)
Cuyahoga Valley                41.24A°N 81.55A°W    32,860.73 acres (133.0 km2)
Death Valley                   36.24A°N 116.82A°W   3,372,401.96 acres (13,647.6 km2)
Denali                         63.33A°N 150.50A°W   4,740,911.72 acres (19,185.8 km2)
Dry Tortugas                   24.63A°N 82.87A°W    64,701.22 acres (261.8 km2)
Everglades                     25.32A°N 80.93A°W    1,508,537.90 acres (6,104.8 km2)
Gates of the Arctic            67.78A°N 153.30A°W   7,523,897.74 acres (30,448.1 km2)
Glacier                        48.80A°N 114.00A°W   1,013,572.41 acres (4,101.8 km2)
(...)

I think that's a start. If some page fails, you have to see if the layout changes, etc.

Of course, you will also have to find a way of obtaining all the links you want to parse.

One important thing: Do you know if is permitted to scrape Wikipedia? I have no idea, but you should see if it is before doing it... ;)

like image 175
Oscar Mederos Avatar answered Sep 27 '22 22:09

Oscar Mederos


Though the question is a little old, another alternative available right now is to avoid any scraping and get the raw data direct from protectedplanet.net - it contains data from the World Database of Protected Areas and the UN's List of Protected Areas. (Disclosure: I worked for UNEP-WCMC, the organisation that produced and maintains the database and the website.)

It's free for non-commercial use, but you'll need to register to download. For example, this page lets you download 22,600 protected areas in the USA as KMZ, CSV and SHP (contains lat, lng, boundaries, IUCN category and a bunch of other metadata).

like image 41
onlyjsmith Avatar answered Sep 27 '22 23:09

onlyjsmith


I would conisder this not the best approach.

My idea would be to go to the API from openstreetmap.org (or any other GEO based API that you can query) and ask it for the data you want. National parks are likely to be found pretty easily. You can get the names from a source like Wikipedia and then ask ony of the GEO APIs to give you the information you want.

BTW, what'S wrong with Wikipedias List of National Parks?

like image 28
DrColossos Avatar answered Sep 27 '22 23:09

DrColossos