Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to list Wikipedia page titles with links using JSON?

This is my current code. It lists out page titles perfectly, but the links all return 'undefined'.

function func(json) {
  var e = document.getElementById('wiki');
  var i;
  for (i=0; i < json.query.allpages.length; i++) {
    e.innerHTML += i + ": " + '<a href="' + "http://en.wikipedia.org/wiki/" +  json.query.link+ '">' +  json.query.allpages[i].title + '</a>' + "<br />";
  }
}

function getFromWikipedia() {
  var txt = document.getElementById('txt');
  var e = document.getElementById('wiki');
  var o = document.createElement("script");
      o.setAttribute("src", "http://en.wikipedia.org/w/api.php?action=query&list=allpages&format=json&apfrom="+txt.value+"&generator=alllinks&callback=func");
  e.appendChild(o);
}

Appending "&prop=links" and/or "&generator=alllinks" to the URL doesn't seem to affect the result.

I would like to know what should I include in this portion:

'<a href="' + json.query.link+ '">'

in order to list the page titles with their respective links. I have tried "json.query.allpages[i].pageID" and "json.query.alllinks" but it has not been working.

Edit: Gave up on finding URL and went to do the pageid method instead.

Solved it with this:

e.innerHTML += i + ": " + '<a href="'+ "http://en.wikipedia.org/wiki/?curid="+  json.query.allpages[i].pageid + '">' +  json.query.allpages[i].title + '</a>' + "<br />";
like image 341
isopach Avatar asked Oct 20 '22 07:10

isopach


2 Answers

You can create the link directly using the pageid:

function func(json) {
  var e = document.getElementById('wiki');
  var i;
  for (i=0; i < json.query.allpages.length; i++) {
    e.innerHTML += i + ": " + '<a href="' + "http://en.wikipedia.org/?curid=" +  json.query.allpages[i].pageid+ '">' +  json.query.allpages[i].title + '</a>' + "<br />";
  }
}
like image 55
schudel Avatar answered Oct 22 '22 00:10

schudel


The fact that you have both list= and generator= in the same query suggests to me that you don't fully understand how generators work in the MediaWiki API.

Basically, a generator is a way to use a list as the source of pages to retrieve properties for. It does not make any sense to use a generator as the input to another list query. That is, you'd normally use generator= with prop=, not with list=. The only reason MediaWiki (seemingly) allows that at all is because:

  1. You can make a query with a page list (or a generator) but no prop= parameter, like this. If you do, you'll just get a minimal default set of properties (title, namespace and page ID) for the pages.

  2. You can also combine a properties query and a list query into a single request, like this. You'll just get the results for both queries, merged into the same JSON/XML/etc. output, but they'll be otherwise completely separate. (You can also make multiple simultaneous list queries that way.)

Thus, when you combine a generator= with a list= query, you'll get both the usual output for the list and a minimal set of properties for the pages matched by the generator. The two outputs will not be connected in any real way, except for being part of the same API response.


Anyway, you wanted to know how to obtain the titles and URLs of all Wikipedia pages with links. Well, as schudel notes in their answer, to get the URLs for some pages you need prop=info with inprop=url; to run this query on all linked pages, you can use generator=alllinks. Thus, you end up with:

  • https://en.wikipedia.org/w/api.php?action=query&prop=info&inprop=url&generator=alllinks

Note that this gives information about all pages that have links from them. To run the query on all pages with links to them, you need to add the parameter galunique=true:

  • https://en.wikipedia.org/w/api.php?action=query&prop=info&inprop=url&generator=alllinks&galunique=true

(Yes, this is documented, although not as clearly as it perhaps could be.)

Obviously, the link targets will include a lot of missing pages. The fact that the link sources seemingly also include a missing page with an empty title is presumably due to a faulty record in Wikipedia's link database. This could be fixed by rebuilding the (redundant) links table, but, given Wikipedia's size, this would take quite a bit of time (during which, presumably, the site would have to be locked into read-only mode to avoid further inconsistencies).


To process this data in JavaScript, you could do something like this:

var apiURL = 'https://en.wikipedia.org/w/api.php?format=json&action=query&prop=info&inprop=url&generator=alllinks&callback=myCallback';

function myCallback(json) {
  var e = document.getElementById('wiki');
  for (var id in json.query.pages) {
    var page = json.query.pages[id];
    if (typeof(page.missing) !== 'undefined') continue;
    e.innerHTML += 
      id + ': <a href="' + escapeHTML(page.fullurl) + '">' + escapeHTML(page.title) + '</a><br />';
  }
  // handle query continuations:
  if (json.continue) {
    var continueURL = apiURL;
    for (var attr in json.continue) {
      continueURL += '&' + attr + '=' + encodeURIComponent(json.continue[attr]);
    }
    doAjaxRequest(continueURL);
}

doAjaxRequest(apiURL + '&continue=');

Note that I've also included a basic mechanism for handling query continuations, since you'll surely need to handle those when using alllinks. Implementing the helper functions escapeHTML() and doAjaxRequest() is left as an exercise. Also note that I haven't actually tested this code; I think it's OK, but there might be bugs that I've missed. It will also produce a ridiculously long list, and probably slow your browser to a crawl, simply because Wikipedia has a lot of pages. For a real application, you'd probably want to introduce some kind of an on-demand loading scheme (e.g. only loading more results when the user scrolls down to the end of the current list).

like image 20
Ilmari Karonen Avatar answered Oct 22 '22 01:10

Ilmari Karonen