Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Full urls of images of a given page on Wikipedia (only those I see on the page)

I'd want to extract all full urls of images of "Google"'s page on Wikipedia

I have tried with:

http://en.wikipedia.org/w/api.php?action=query&titles=Google&generator=images&gimlimit=10&prop=imageinfo&iiprop=url|dimensions|mime&format=json

but, in this way, I got also not google-related images, such as:

http://upload.wikimedia.org/wikipedia/en/a/a4/Flag_of_the_United_States.svg
http://upload.wikimedia.org/wikipedia/en/4/4a/Commons-logo.svg
http://upload.wikimedia.org/wikipedia/en/4/4a/Commons-logo.svg
http://upload.wikimedia.org/wikipedia/commons/f/fe/Crystal_Clear_app_browser.png

How can I extract just only images that I see on Google page

like image 326
sparkle Avatar asked Sep 05 '25 05:09

sparkle


1 Answers

  1. Retrieve page source code, https://en.wikipedia.org/w/index.php?title=Google&action=raw
  2. Scan it for substrings like [[File:Google web search.png|thumb|left|On February 14, 2012, Google updated its homepage with a minor twist. There are no red lines above the options in the black bar, and there is a tab space before the "+You". The sign-in button has also changed, it is no longer in the black bar, instead under it as a button.]]
  3. Ask API for all pictures on page, http://en.wikipedia.org/w/api.php?action=query&titles=Google&generator=images&gimlimit=10&prop=imageinfo&iiprop=url|dimensions|mime&format=json
  4. Filter out urls but those which match picture names found in step 2.

Steps 2 and 4 need more explanation.

@2. Regexp /\b(File|Image):[^]|\n\r]+/ should be enough. In Ruby's regexps, \b denotes word boundary which might be unsupported in language of your choice. Regexp I proposed will match all cases which come to my mind: [[File:something.jpg]], gallery tags: <gallery>\nFile:one.jpg\nFile:two.jpg\n</gallery>, templates: {{Infobox|pic = File:something.jpg}}. However, it won't match filenames which contain ]. I'm not sure if they're legal, but if they are, they must be very uncommon and it should not be a big deal.

If you want to match only constructs like this: [[File:something.jpg|thumb|description]], following regexp will work better: /\[\[(File|Image):[^]|]+/

@4. I'd remove all characters from names which match /[^A-Za-z0-9]/. It's easier than escaping them and, in most cases, enough.

Icons are most often attached in templates, contrary to pictures related to article subject, which are most often attached directly ([[File:…]]). There are exceptions though, for example in some articles pictures are attached with {{Gallery}} template. There is also <gallery> tag which introduces special syntax for galleries. You got to tune my solution to your needs, and even then it won't be perfect, but it should be good enough.

like image 165
skalee Avatar answered Sep 07 '25 22:09

skalee