Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Making a tree of Wikipedia links

I am trying to use the Wikipedia API to get all links on all pages. Currently I'm using

https://en.wikipedia.org/w/api.php?format=json&action=query&generator=alllinks&prop=links&pllimit=max&plnamespace=0

but this does not seem to start at the first article and end at the last. How can I get this to generate all pages and all their links?

like image 474
dangee1705 Avatar asked Sep 18 '25 11:09

dangee1705


2 Answers

The English Wikipedia has approximately 1.05 billion internal links. Considering the list=alllinks module has a limit of 500 links per request, it's not realistic to get all links from the API.

Instead, you can download Wikipedia's database dumps and use those. Specifically, you want the pagelinks dump, containing information about the links themselves, and very likely also the page dump, for mapping page ids to page titles.

like image 186
svick Avatar answered Sep 21 '25 19:09

svick


I know this is an old question, but in case anyone else is searching and finds this, I highly recommend looking at Wikicrush to extract the link graph for all of Wikipedia. It produces a relatively compact representation that can be used to very quickly traverse links.

like image 20
jkraybill Avatar answered Sep 21 '25 19:09

jkraybill