I'm trying to create a regular expression to extract URLs from text documents using Java, but thus far I've been unsuccessful. The two cases I'm looking to capture are listed below:
URLs that start with http:// URLs that start with www. (Missing the protocol from the front)
along with the query string parameters.
Thanks! I wish I really knew Regular expressions better.
Cheers,
If you want to make sure you are really matching a url adress and not only some word starting with 'www.' you can use the expression mentioned by DVK before. I modified it slightly and wrote a small code snippet to be a starting point for you:
import java.util.*;
import java.util.regex.*;
class FindUrls
{
public static List<String> extractUrls(String input) {
List<String> result = new ArrayList<String>();
Pattern pattern = Pattern.compile(
"\\b(((ht|f)tp(s?)\\:\\/\\/|~\\/|\\/)|www.)" +
"(\\w+:\\w+@)?(([-\\w]+\\.)+(com|org|net|gov" +
"|mil|biz|info|mobi|name|aero|jobs|museum" +
"|travel|[a-z]{2}))(:[\\d]{1,5})?" +
"(((\\/([-\\w~!$+|.,=]|%[a-f\\d]{2})+)+|\\/)+|\\?|#)?" +
"((\\?([-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" +
"([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)" +
"(&(?:[-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" +
"([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)*)*" +
"(#([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)?\\b");
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
result.add(matcher.group());
}
return result;
}
}
All RegEx -based code is over-engineered, especially code from the most voted answer, and here is why: it will find only valid URLs! As a sample, it will ignore anything starting with "http://" and having non-ASCII characters inside.
Even more: I have encountered 1-2-seconds processing times (single-threaded, dedicated) with Java RegEx package for very small and simple sentences, nothing specific; possibly bug in Java 6 RegEx...
Simplest/Fastest solution would be to use StringTokenizer to split text into tokens, to remove tokens starting with "http://" etc., and to concatenate tokens into text again.
If you really want to use RegEx with Java, try Automaton
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With