Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing Wikipedia countries, regions, cities

Is it possible to get a list of all Wikipedia countries, regions and cities with relations between them? I couldn't find any API appropriate for this task. What is be the easiest way to parse all the information I need? PS: I know, that there are another datasources I can get this information from. But I am interested in Wikipedia...

like image 226
Vladimir Shevchenko Avatar asked Jul 11 '14 11:07

Vladimir Shevchenko


2 Answers

[2020 update] this is now best done using the Wikidata Query Service, you can run super specific queries with a bit of SPARQL, example: Find all countries and their label. See Wikidata Query Help


It might be a bit tedious to get the whole graph but you can get most of the data from the experimental/non-official Wikidata Query API.

I suggest the following workflow:

  • Go to an instance of the kind of entities you want to work with, say Estonia (Q191) and look for its instance of (P31) properties, you will find: country, sovereign state, member of the UN, member of the EU, etc.

  • Use the Wikidata Query API claim command to output every entity that as the chosen P31 property. Lets try with country (Q6256):

    http://wdq.wmflabs.org/api?q=claim[31:6256]

It outputs an array of numeric ids: that's your countries! (notice that the result is still incomplete as there are only 141 items found: either countries are missing from Wikidata, or, as suggested by Nemo in comments, some countries are to be found in country (Q6256) subclasses(P279))

  • You may want more than ids though, so you can ask Wikidata Official API for entities data:

    https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q16&format=json&props=labels|claims&languages=en|fr

    (here Canada(Q16) data, in json, with only claims and labels data, in English and French. Look at the documentation to adapt parameters to your needs)

You can query multiple entities at a time, with a limit of 50, as follow:

https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q16|Q17|Q20|Q27|Q28|Q29|Q30|Q31|Q32|Q33|Q34|Q35|Q36|Q37|Q38|Q39|Q40|Q41|Q43|Q45|Q77|Q79|Q96|Q114&format=json&props=labels|claims&languages=en|fr
  • From every countries data, you could look for entities registered as administrative subdivisions (P150) and repeat on those new entities.

  • Aternatively, you can get all the tree of administrative subdivisions with the tree command. For instance, for France(Q142) that would be http://wdq.wmflabs.org/api?q=tree[142][150] Tadaaa, 36994 items! But that's way harder to refine given the different kinds of subdivision you can encounter from a country to another. And avoid doing this kind of query from a browser, it might crash.

  • You now just have to find cities by countries by refining this last query with the claim command, and the appropriate sub-class(P279) of municipality(Q15284) entity (all available here): for France, that's commune (Q484170), so your request looks like

    http://wdq.wmflabs.org/api?q=tree[142][150] AND claim[31:484170]

    then repeat for all the countries: have fun!

like image 134
maxlath Avatar answered Oct 09 '22 20:10

maxlath


You should go with Wikidata and/or dbpedia.

Personally I'd start with Wikidata as it's directly using MediaWiki, with the same API so you can use similar code. I would use pywikibot to get started. Like that you can still request pages from Wikipedia where that makes sense (e.g. list pages or categories).

Here's a nice overview of ways to access Wikidata

like image 23
kqw Avatar answered Oct 09 '22 18:10

kqw