I'd want to extract all full urls of images of "Google"'s page on Wikipedia
I have tried with:
http://en.wikipedia.org/w/api.php?action=query&titles=Google&generator=images&gimlimit=10&prop=imageinfo&iiprop=url|dimensions|mime&format=json
but, in this way, I got also not google-related images, such as:
http://upload.wikimedia.org/wikipedia/en/a/a4/Flag_of_the_United_States.svg
http://upload.wikimedia.org/wikipedia/en/4/4a/Commons-logo.svg
http://upload.wikimedia.org/wikipedia/en/4/4a/Commons-logo.svg
http://upload.wikimedia.org/wikipedia/commons/f/fe/Crystal_Clear_app_browser.png
How can I extract just only images that I see on Google page
[[File:Google web search.png|thumb|left|On February 14, 2012, Google updated its homepage with a minor twist. There are no red lines above the options in the black bar, and there is a tab space before the "+You". The sign-in button has also changed, it is no longer in the black bar, instead under it as a button.]]
Steps 2 and 4 need more explanation.
@2. Regexp /\b(File|Image):[^]|\n\r]+/
should be enough. In Ruby's regexps, \b
denotes word boundary which might be unsupported in language of your choice. Regexp I proposed will match all cases which come to my mind: [[File:something.jpg]]
, gallery tags: <gallery>\nFile:one.jpg\nFile:two.jpg\n</gallery>
, templates: {{Infobox|pic = File:something.jpg}}
. However, it won't match filenames which contain ]
. I'm not sure if they're legal, but if they are, they must be very uncommon and it should not be a big deal.
If you want to match only constructs like this: [[File:something.jpg|thumb|description]]
, following regexp will work better: /\[\[(File|Image):[^]|]+/
@4. I'd remove all characters from names which match /[^A-Za-z0-9]/
. It's easier than escaping them and, in most cases, enough.
Icons are most often attached in templates, contrary to pictures related to article subject, which are most often attached directly ([[File:…]]
). There are exceptions though, for example in some articles pictures are attached with {{Gallery}} template. There is also <gallery>
tag which introduces special syntax for galleries. You got to tune my solution to your needs, and even then it won't be perfect, but it should be good enough.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With