Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using regex to extract URLs from plain text with Perl

Tags:

regex

url

perl

How can I use Perl regexps to extract all URLs of a specific domain (with possibly variable subdomains) with a specific extension from plain text? I have tried:

my $stuff = 'omg http://fail-o-tron.com/bleh omg omg omg omg omg http://homepage.com/woot.gif dfgdfg http://shomepage.com/woot.gif aaa';
while($stuff =~ m/(http\:\/\/.*?homepage.com\/.*?\.gif)/gmsi)
{
print $1."\n";
}

It fails horribly and gives me:

http://fail-o-tron.com/bleh omg omg omg omg omg http://homepage.com/woot.gif
http://shomepage.com/woot.gif

I thought that wouldn't happen because I am using .*?, which ought to be non-greedy and give me the smallest match. Can anyone tell me what I am doing wrong? (I don't want some uber-complex, canned regexp to validate URLs; I want to know what I am doing wrong so I can learn from it.)

like image 636
test1234 Avatar asked Jun 27 '09 18:06

test1234


1 Answers

URI::Find is specifically designed to solve this problem. It will find all URIs and then you can filter them. It has a few heuristics to handle things like trailing punctuation.

UPDATE: Recently updated to handle Unicode.

like image 64
Schwern Avatar answered Nov 15 '22 08:11

Schwern