I need to query wikipedia for just one very particular purpose, that is to to get the text for a given url. To be a little more precise:
I have about 14.000 wikipedia urls of the english corpus and I need to get the text, or at least the introduction of each of these urls. My further processing will be in python, so this would be the language of choice.
I am searching for the method with best performance and made up 4 different approaches:
sql
with pythonwhich method should I use, i. e. which method has best performance and is somehow standard?
Some thoughts:
I have about 14.000 wikipedia urls of the english corpus and I need to get the text, or at least the introduction of each of these urls.
1 - get the xml dump and parse directly via python
There are currently 4,140,640 articles in the English Wikipedia. You're interested in 14,000 articles or about one third of a percent of the total. That sounds too sparse to allow dumping all the articles to be the best approach.
2 - get the xml, set up the database and query sql with python
Do you expect the set of articles your interested in to grow or change? If you need to rapidly respond to changes in your set of articles, a local database may be useful. But you'll have to keep it up to date. It's simpler to get the live data using the API, if that's fast enough.
4 - Just crawl these wikipedia pages (which is maybe kind of sneaky and as well annoying because its html and no plain text)
If you can get what you need out of the API, that will be better than scraping the Wikipedia site.
3 - use the wikipedia api and query it directly via python
Based on the low percentage of articles that you're interested in, 0.338%, this is probably the best approach.
Be sure to check out The MediaWiki API documentation and API Reference. There's also the python-wikitools module.
I need to get the text, or at least the introduction
If you really only need the intro, that will save a lot of traffic and really makes using the API the best choice, by far.
There are a variety of ways to retrieve the introduction, here's one good way:
http://en.wikipedia.org/w/api.php?action=query&prop=extracts&exintro&format=xml&titles=Python_(programming_language)
If you have many requests to process at a time, you can batch them in groups of up to 20 articles:
http://en.wikipedia.org/w/api.php?action=query&prop=extracts&exintro&exlimit=20&format=xml&titles=Python_(programming_language)|History_of_Python|Guido_van_Rossum
This way you can retrieve your 14,000 article introductions in 700 round trips.
Note: The API reference exlimit
documentation states:
No more than 20 (20 for bots) allowed
Also note: The API documentation section on Etiquette and usage limits says:
If you make your requests in series rather than in parallel (i.e. wait for the one request to finish before sending a new request, such that you're never making more than one request at the same time), then you should definitely be fine. Also try to combine things into one request where you can (e.g. use multiple titles in a titles parameter instead of making a new request for each title.
Wikipedia is constantly updated. If you ever need to refresh your data, tracking revision IDs and timestamps will enable you to identify which of your local articles are stale. You can retrieve revision information (along with the intro, here with multiple articles) using (for example):
http://en.wikipedia.org/w/api.php?action=query&prop=revisions|extracts&exintro&exlimit=20&rvprop=ids|timestamp&format=xml&titles=Python_(programming_language)|History_of_Python|Guido_van_Rossum
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With