I am trying to use the Wikipedia API to get all links on all pages. Currently I'm using
https://en.wikipedia.org/w/api.php?format=json&action=query&generator=alllinks&prop=links&pllimit=max&plnamespace=0
but this does not seem to start at the first article and end at the last. How can I get this to generate all pages and all their links?
The English Wikipedia has approximately 1.05 billion internal links. Considering the list=alllinks
module has a limit of 500 links per request, it's not realistic to get all links from the API.
Instead, you can download Wikipedia's database dumps and use those. Specifically, you want the pagelinks
dump, containing information about the links themselves, and very likely also the page
dump, for mapping page ids to page titles.
I know this is an old question, but in case anyone else is searching and finds this, I highly recommend looking at Wikicrush to extract the link graph for all of Wikipedia. It produces a relatively compact representation that can be used to very quickly traverse links.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With