Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I make the Wikipedia API normalize and redirect without knowing the exact case of all characters?

If I try to get the language links for a page on Wikipedia via their API like this:

http://en.wikipedia.org/w/api.php?action=query&prop=langlinks&format=json&lllimit=10&llurl=&titles=wreck-it%20Ralph&redirects=

I get a list of results.

But if I down-case the R in Ralph like this:

http://en.wikipedia.org/w/api.php?action=query&prop=langlinks&format=json&lllimit=10&llurl=&titles=wreck-it%20ralph&redirects=

I get no results.

Looking at the returned information, it looks like Wikipedia normalizes "wreck-it Ralph" in the first example to "Wreck-it Ralph" which redirects to "Wreck-It Ralph".

In the second example, "wreck-it ralph" is normalized to "Wreck-it ralph" which doesn't redirect anywhere, apparently.

Searching for "wreck-it ralph" on http://wikipedia.org works, of course:

http://www.wikipedia.org/search-redirect.php?family=wikipedia&search=wreck-it+ralph&language=en

Can I make the langlinks query work the same way, helping me when I don't know the exact case of all the characters of the search term?

Update From the answer by Sorawee I managed to find out how to do a case-insensitive search: https://en.wikipedia.org/w/api.php?action=query&generator=search&format=json&gsrsearch=wreck-it%20ralph&gsrlimit=1&prop=info

like image 991
Peter Jaric Avatar asked Jan 18 '14 23:01

Peter Jaric


1 Answers

In MediaWiki, all titles will be capitalized automatically. Therefore, "wreck-it Ralph" and "Wreck-it Ralph" are the same page. Similarly, "wreck-it ralph" and "Wreck-it ralph" are the same page. Note that capitalization only applies to the very first letter.

MediaWiki also has pages called "redirect pages." A redirect page can redirect you from the page to another totally different page. For example, https://en.wikipedia.org/wiki/Template:cn will redirect you to https://en.wikipedia.org/wiki/Template:Citation_needed. These pages are created by users, not software.

The situation you asked is like the below diagram.

"wreck-it Ralph" =normalized=> "Wreck-it Ralph" =redirected=> "Wreck-It Ralph" (found)

"wreck-it ralph" =normalized=> "Wreck-it ralph" (not exist)

So now you know that you can't query page "wreck-it ralph," because it doesn't exist.

However, if you want to query from "wreck-it Ralph," you might or might not get the langlinks of "Wreck-It Ralph." It depends on the parameter "&redirects=." If you don't have this parameter, it will not return any langlinks, as "wreck-it Ralph" itself has no langlinks. With "&redirects=," api will search langlinks at redirect page instead (if it exists). Therefore, it will return the langlinks that you want. You can compare:

  • http://en.wikipedia.org/w/api.php?action=query&prop=langlinks&format=json&lllimit=10&llurl=&titles=wreck-it%20Ralph&redirects=
  • http://en.wikipedia.org/w/api.php?action=query&prop=langlinks&format=json&lllimit=10&llurl=&titles=wreck-it%20Ralph

For the question why does http://www.wikipedia.org/search-redirect.php?family=wikipedia&search=wreck-it+ralph&language=en work, the answer is that search-redirect.php is not api. It searches and returns for the nearest match, while the api that we are discussing returns only the exact result.

like image 186
Sorawee Porncharoenwase Avatar answered Sep 28 '22 09:09

Sorawee Porncharoenwase