Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to obtain a list of titles of all Wikipedia articles

I'd like to obtain a list of all the titles of all Wikipedia articles. I know there are two possible ways to get content from a Wikimedia powered wiki. One would be the API and the other one would be a database dump.

I'd prefer not to download the wiki dump. First, it's huge, and second, I'm not really experienced with querying databases. The problem with the API on the other hand is that I couldn't figure out a way to only retrieve a list of the article titles and even if it would need > 4 mio requests which would probably get me blocked from any further requests anyway.

So my question is

  1. Is there a way to obtain only the titles of Wikipedia articles via the API?
  2. Is there a way to combine multiple request/queries into one? Or do I actually have to download a Wikipedia dump?
like image 762
Flavio Avatar asked Jun 29 '14 08:06

Flavio


People also ask

How much is the entirety of Wikipedia?

Including articles, the total number of pages is 56,416,157. Being pages themselves, articles make up 11.6 percent of all pages on Wikipedia. As of 2 April 2022, the size of the current version of all articles compressed is about 20.69 GB without media.

How do I download Wikipedia content?

You can export a Wikipedia page such as an article and save it as a PDF file in several ways: Some web browsers allow you to simply Save As... or Print to PDF. Wikipedia's inbuilt Download as PDF option. Other PDF software can be used to create a PDF from the web page, which may give more control over the output.

Where do all Wikipedia pages lead?

Amy Lee (2011-11-14). "All Wikipedia Ends In Philosophy, Literally". The Huffington Post. Wikipedia Pages That Don't Lead to Philosophy an in-progress (unfinished) database of Wikipedia page loops that result in a page not leading to philosophy.


2 Answers

The allpages API module allows you to do just that. Its limit (when you set aplimit=max) is 500, so to query all 4.5M articles, you would need about 9000 requests.

But a dump is a better choice, because there are many different dumps, including all-titles-in-ns0 which, as its name suggests, contains exactly what you want (59 MB of gzipped text).

like image 65
svick Avatar answered Nov 12 '22 07:11

svick


Right now, as per the current statistics the number of articles is around 5.8M. To get the list of pages I did use the AllPages API. However, the number of pages I get is around 14.5M which is ~3 times of what I was expecting. I restricted myself to namespace 0 to get the list. Following is the sample code that I am using:

# get the list of all wikipedia pages (articles) -- English import sys from simplemediawiki import MediaWiki  listOfPagesFile = open("wikiListOfArticles_nonredirects.txt", "w")   wiki = MediaWiki('https://en.wikipedia.org/w/api.php')  continueParam = '' requestObj = {} requestObj['action'] = 'query' requestObj['list'] = 'allpages' requestObj['aplimit'] = 'max' requestObj['apnamespace'] = '0'  pagelist = wiki.call(requestObj) pagesInQuery = pagelist['query']['allpages']  for eachPage in pagesInQuery:     pageId = eachPage['pageid']     title = eachPage['title'].encode('utf-8')     writestr = str(pageId) + "; " + title + "\n"     listOfPagesFile.write(writestr)  numQueries = 1  while len(pagelist['query']['allpages']) > 0:      requestObj['apcontinue'] = pagelist["continue"]["apcontinue"]     pagelist = wiki.call(requestObj)       pagesInQuery = pagelist['query']['allpages']      for eachPage in pagesInQuery:         pageId = eachPage['pageid']         title = eachPage['title'].encode('utf-8')         writestr = str(pageId) + "; " + title + "\n"         listOfPagesFile.write(writestr)         # print writestr       numQueries += 1      if numQueries % 100 == 0:         print "Done with queries -- ", numQueries         print numQueries  listOfPagesFile.close() 

The number of queries fired is around 28900, which results in approx. 14.5M names of the pages.

I also tried the all-titles link mentioned in the above answer. In that case as well I am getting around 14.5M pages.

I thought that this overestimate to the actual number of pages is because of the redirects, and did add the 'nonredirects' option to the request object:

requestObj['apfilterredir'] = 'nonredirects' 

After doing that I get only 112340 number of pages. Which is too small as compared to 5.8M.

With the above code I was expecting roughly 5.8M pages, but that doesn't seem to be the case.

Is there any other option that I should be trying to get the actual (~5.8M) set of page names?

like image 42
jayesh Avatar answered Nov 12 '22 06:11

jayesh