How to obtain a list of titles of all Wikipedia articles

Tags:

I'd like to obtain a list of all the titles of all Wikipedia articles. I know there are two possible ways to get content from a Wikimedia powered wiki. One would be the API and the other one would be a database dump.

I'd prefer not to download the wiki dump. First, it's huge, and second, I'm not really experienced with querying databases. The problem with the API on the other hand is that I couldn't figure out a way to only retrieve a list of the article titles and even if it would need > 4 mio requests which would probably get me blocked from any further requests anyway.

So my question is

Is there a way to obtain only the titles of Wikipedia articles via the API?
Is there a way to combine multiple request/queries into one? Or do I actually have to download a Wikipedia dump?

762

asked Jun 29 '14 08:06

Flavio

2 Answers

The allpages API module allows you to do just that. Its limit (when you set aplimit=max) is 500, so to query all 4.5M articles, you would need about 9000 requests.

But a dump is a better choice, because there are many different dumps, including all-titles-in-ns0 which, as its name suggests, contains exactly what you want (59 MB of gzipped text).

answered Nov 12 '22 07:11

svick

Right now, as per the current statistics the number of articles is around 5.8M. To get the list of pages I did use the AllPages API. However, the number of pages I get is around 14.5M which is ~3 times of what I was expecting. I restricted myself to namespace 0 to get the list. Following is the sample code that I am using:

# get the list of all wikipedia pages (articles) -- English import sys from simplemediawiki import MediaWiki  listOfPagesFile = open("wikiListOfArticles_nonredirects.txt", "w")   wiki = MediaWiki('https://en.wikipedia.org/w/api.php')  continueParam = '' requestObj = {} requestObj['action'] = 'query' requestObj['list'] = 'allpages' requestObj['aplimit'] = 'max' requestObj['apnamespace'] = '0'  pagelist = wiki.call(requestObj) pagesInQuery = pagelist['query']['allpages']  for eachPage in pagesInQuery:     pageId = eachPage['pageid']     title = eachPage['title'].encode('utf-8')     writestr = str(pageId) + "; " + title + "\n"     listOfPagesFile.write(writestr)  numQueries = 1  while len(pagelist['query']['allpages']) > 0:      requestObj['apcontinue'] = pagelist["continue"]["apcontinue"]     pagelist = wiki.call(requestObj)       pagesInQuery = pagelist['query']['allpages']      for eachPage in pagesInQuery:         pageId = eachPage['pageid']         title = eachPage['title'].encode('utf-8')         writestr = str(pageId) + "; " + title + "\n"         listOfPagesFile.write(writestr)         # print writestr       numQueries += 1      if numQueries % 100 == 0:         print "Done with queries -- ", numQueries         print numQueries  listOfPagesFile.close()

The number of queries fired is around 28900, which results in approx. 14.5M names of the pages.

I also tried the all-titles link mentioned in the above answer. In that case as well I am getting around 14.5M pages.

I thought that this overestimate to the actual number of pages is because of the redirects, and did add the 'nonredirects' option to the request object:

requestObj['apfilterredir'] = 'nonredirects'

After doing that I get only 112340 number of pages. Which is too small as compared to 5.8M.

With the above code I was expecting roughly 5.8M pages, but that doesn't seem to be the case.

Is there any other option that I should be trying to get the actual (~5.8M) set of page names?

answered Nov 12 '22 06:11

jayesh

Related questions
                            
                                How to add a link in MediaWiki VisualEditor Toolbar?
                            
                                Default sort column in Wikipedia table
                            
                                Delete or disable user in mediawiki
                            
                                MediaWiki Extension:GoogleAppsAuthentification Multiple Domains
                            
                                TEXTAREAs scroll by themselves (on IE8) every time you type one character
                            
                                Convert from Microsoft Word to Media Wiki Markup Style
                            
                                Import Confluence xml dump into Mediawiki
                            
                                How can I add per page permissions to a user in MediaWiki?
                            
                                Is there any way to convert Wikitext to Markdown in python?
                            
                                Find Leaflet map object after initialisation
                            
                                Getting developers to use a wiki [closed]
                            
                                Parsing a Wikipedia dump
                            
                                All my MediaWiki pages are blank
                            
                                Convert Excel documents to wiki markup
                            
                                Changing the font size of code formatted by SyntaxHighlight GeSHi in MediaWiki
                            
                                Inline Syntax Highlighting in MediaWiki
                            
                                How to group wikipedia categories in python?
                            
                                How do I set og:image so it takes image from page?
                            
                                Convert LaTeX to MediaWiki syntax
                            
                                Is there any API in Java to access wikipedia data

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to obtain a list of titles of all Wikipedia articles

Tags:

wikipedia

wikipedia-api

mediawiki-api

mediawiki

Flavio

People also ask

2 Answers

svick

jayesh

Recent Activity

Donate For Us