Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Query Wikipedia pages with properties

I need to use Wikipedia API Query or any other api such as Opensearch to query for a simple list of pages with some properties.

Input: a list of page (article) titles or ids.
Output: a list of pages that contain the following properties each:
page id
title
snippet/description (like in opensearch api)
page url
image url (like in opensearch api)

A result similar to this:
http://en.wikipedia.org/w/api.php?action=opensearch&search=miles%20davis&limit=20&format=xml
Only with page ids and not for a search, but rather an exact list of pages by either titles or pageids.

This should be a fairly simple thing but I have been stuck with that for quite some time trying all kinds of URL combinations from the MW api manual, without success.

like image 973
yuvalr80 Avatar asked Nov 03 '22 18:11

yuvalr80


1 Answers

I dont't think there is another way than the Open Search API to fetch Open Search data, but depending on which Wikipedia you are interested in, there might be other extensions installed to help you. Taking English Wikipedia as an example, we can make use of the MobileFrontend and PageImages extensions, that happens to be installed there.

  • Title and url are available from the native MediaWiki API. To get the url, you can use prop=info, and specify with inprop=url that it is the url you are interested in.
  • Prominent images of a page is returned by prop=pageimages, thanks to PageImages.
  • MobileFrontend adds a property called extracts, that you can use with the directive exintro to get the first paragraph. Note however that MediWiki markup is complex, and result might not always be perfect. If we put it all together in one single query, it would be something like this:

http://en.wikipedia.org/w/api.php?action=query&pageids=21482&prop=pageimages|info|extracts&inprop=url&exintro

giving this:

<api>
  <query>
    <pages>
      <page pageid="21482" ns="0" title="Nairobi" pageimage="Nairobi_Montage.jpg" contentmodel="wikitext" pagelanguage="en" touched="2014-02-06T06:10:01Z" lastrevid="594161616" counter="" length="89157" fullurl="http://en.wikipedia.org/wiki/Nairobi" editurl="http://en.wikipedia.org/w/index.php?title=Nairobi&amp;action=edit">
        <thumbnail source="http://upload.wikimedia.org/wikipedia/commons/thumb/6/66/Nairobi_Montage.jpg/45px-Nairobi_Montage.jpg" width="45" height="50" />
        <extract xml:space="preserve">
             &lt;p&gt;&lt;b&gt;Nairobi&lt;/b&gt; /naɪˈroʊbi/ is the [...]
        </extract>
      </page>
    </pages>
  </query>
</api>
like image 194
leo Avatar answered Nov 15 '22 05:11

leo