Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there any dump for wikipedia Pageid to wikidata id mapping?

This page: http://wikidata.dbpedia.org/downloads/20160111/ has a dump called wikidatawiki-20160111-page-ids.ttl.bz2 which contains Wikidata id to what they called wikipage id. The wikipage id seems different from the Wikipedia pageid though.

e.g. for Germany:

  • Wikipedia pageid = 11867
  • Wikidata id = Q183 and wikipage id = 322.

So basically this dump maps Q183 to 322, while I need to map Q183 to 11867.

As a reference : https://en.wikipedia.org/w/index.php?title=Germany&curid=11867 the curid in the URL represents the Wikipedia page id.

Is there any equivalent dump file out there that has the Wikidata ids and the Wikipedia pageid? (I don't want to use an API and loop my Wikipedia page id one by one like this one does: https://en.wikipedia.org/w/api.php?action=query&prop=pageprops&format=xml&pageids=11867)

Edit: I'm not sure about waht is exactly the wikipage id, but maybe there is a wikipageId to Wikipedia pageid mapping file on top of the dump I mentioned in the question.

like image 1000
user3700389 Avatar asked Feb 07 '23 13:02

user3700389


2 Answers

I created a Python package and command line tool to deal with the issue called wikimapper. It can be installed via pip install wikimapper. It uses the Wikipedia SQL dumps to create an index that then can be used to map many times very fast (much faster than the Wikidata SPARQL endpoint). You could either use one of my precomputed indices and use this sqlite3 database or use the package to map Wikipedia page titles/Wikipedia URLs to Wikidata IDs and vice versa. Using pages names or URLs instead of interal Wikipedia IDs should be more comfortable.

like image 111
jcklie Avatar answered Mar 10 '23 17:03

jcklie


If you are willing to consider an API call solution instead of using the dump plus format adjustment, you could use the pageprops property of the query action.

For instance, if we want to find out the Wikidata item for Albert Einstein, given the wikipedia page title, you'd do:

 https://en.wikipedia.org/w/api.php?action=query&format=json&prop=pageprops&titles=Albert Einstein

Which gives:

 {
   "batchcomplete": "",
   "query": {
     "pages": {
       "736": {
         "pageid": 736,
         "ns": 0,
         "title": "Albert Einstein",
         "pageprops": {
           "defaultsort": "Einstein, Albert",
           "page_image": "Einstein_1921_by_F_Schmutzer_-_restoration.jpg",
           "wikibase-badge-Q17437798": "1",
           "wikibase_item": "Q937"
         }
       }
     }
   }
 }

Like this we can retrieve the wikidata item id at wikibase_item.

(This is as originally answered by Dmitry Brant in the Mediawiki-api mailing list)

Potentially this is a better solution because:

  1. You only search for the items you need instead of having to search through the whole dump
  2. You can get the answer in JSON or XML directly
like image 37
atineoSE Avatar answered Mar 10 '23 17:03

atineoSE