Using regex to extract URLs from plain text with Perl

Question

How can I use Perl regexps to extract all URLs of a specific domain (with possibly variable subdomains) with a specific extension from plain text? I have tried:

my $stuff = 'omg http://fail-o-tron.com/bleh omg omg omg omg omg http://homepage.com/woot.gif dfgdfg http://shomepage.com/woot.gif aaa';
while($stuff =~ m/(http\://.*?homepage.com/.*?\.gif)/gmsi)
{
print $1."
";
}

It fails horribly and gives me:

http://fail-o-tron.com/bleh omg omg omg omg omg http://homepage.com/woot.gif
http://shomepage.com/woot.gif

I thought that wouldn't happen because I am using .*?, which ought to be non-greedy and give me the smallest match. Can anyone tell me what I am doing wrong? (I don't want some uber-complex, canned regexp to validate URLs; I want to know what I am doing wrong so I can learn from it.)

Schwern · Accepted Answer

URI::Find is specifically designed to solve this problem. It will find all URIs and then you can filter them. It has a few heuristics to handle things like trailing punctuation.

UPDATE: Recently updated to handle Unicode.

Using regex to extract URLs from plain text with Perl

Tags:

regex

url

perl

test1234

1 Answers

Schwern

Recent Activity

Donate For Us

Using regex to extract URLs from plain text with Perl

Tags:

regex

url

perl

test1234

1 Answers

Schwern

Related questions

Recent Activity

Donate For Us