Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python querying wikipedia performance

I need to query wikipedia for just one very particular purpose, that is to to get the text for a given url. To be a little more precise:

I have about 14.000 wikipedia urls of the english corpus and I need to get the text, or at least the introduction of each of these urls. My further processing will be in python, so this would be the language of choice.

I am searching for the method with best performance and made up 4 different approaches:

  1. get the xml dump and parse directly via python
    -> further question here would be: how to query the xml file, knowing the url?
  2. get the xml, set up the database and query sql with python
    -> further question here would be: how to query the sql, knowing the url?
  3. use the wikipedia api and query it directly via python
  4. Just crawl these wikipedia pages (which is maybe kind of sneaky and as well annoying because its html and no plain text)

which method should I use, i. e. which method has best performance and is somehow standard?

like image 433
Milla Well Avatar asked Feb 17 '23 22:02

Milla Well


1 Answers

Some thoughts:

I have about 14.000 wikipedia urls of the english corpus and I need to get the text, or at least the introduction of each of these urls.

1 - get the xml dump and parse directly via python

There are currently 4,140,640 articles in the English Wikipedia. You're interested in 14,000 articles or about one third of a percent of the total. That sounds too sparse to allow dumping all the articles to be the best approach.

2 - get the xml, set up the database and query sql with python

Do you expect the set of articles your interested in to grow or change? If you need to rapidly respond to changes in your set of articles, a local database may be useful. But you'll have to keep it up to date. It's simpler to get the live data using the API, if that's fast enough.

4 - Just crawl these wikipedia pages (which is maybe kind of sneaky and as well annoying because its html and no plain text)

If you can get what you need out of the API, that will be better than scraping the Wikipedia site.

3 - use the wikipedia api and query it directly via python

Based on the low percentage of articles that you're interested in, 0.338%, this is probably the best approach.

Be sure to check out The MediaWiki API documentation and API Reference. There's also the python-wikitools module.

I need to get the text, or at least the introduction

If you really only need the intro, that will save a lot of traffic and really makes using the API the best choice, by far.

There are a variety of ways to retrieve the introduction, here's one good way:

http://en.wikipedia.org/w/api.php?action=query&prop=extracts&exintro&format=xml&titles=Python_(programming_language)

If you have many requests to process at a time, you can batch them in groups of up to 20 articles:

http://en.wikipedia.org/w/api.php?action=query&prop=extracts&exintro&exlimit=20&format=xml&titles=Python_(programming_language)|History_of_Python|Guido_van_Rossum

This way you can retrieve your 14,000 article introductions in 700 round trips.

Note: The API reference exlimit documentation states:

No more than 20 (20 for bots) allowed

Also note: The API documentation section on Etiquette and usage limits says:

If you make your requests in series rather than in parallel (i.e. wait for the one request to finish before sending a new request, such that you're never making more than one request at the same time), then you should definitely be fine. Also try to combine things into one request where you can (e.g. use multiple titles in a titles parameter instead of making a new request for each title.

Wikipedia is constantly updated. If you ever need to refresh your data, tracking revision IDs and timestamps will enable you to identify which of your local articles are stale. You can retrieve revision information (along with the intro, here with multiple articles) using (for example):

http://en.wikipedia.org/w/api.php?action=query&prop=revisions|extracts&exintro&exlimit=20&rvprop=ids|timestamp&format=xml&titles=Python_(programming_language)|History_of_Python|Guido_van_Rossum

like image 51
jimhark Avatar answered Feb 20 '23 16:02

jimhark