Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Detect and extract url from a string?

Tags:

java

regex

url

This is a easy question,but I just don't get it. I want to detect url in a string and replace them with a shorten one.

I found this expression from stackoverflow,But the result is just http

Pattern p = Pattern.compile("\\b(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]",Pattern.CASE_INSENSITIVE);         Matcher m = p.matcher(str);         boolean result = m.find();         while (result) {             for (int i = 1; i <= m.groupCount(); i++) {                 String url=m.group(i);                 str = str.replace(url, shorten(url));             }             result = m.find();         }         return html; 

Is there any better idea?

like image 651
Shisoft Avatar asked Apr 19 '11 08:04

Shisoft


People also ask

How do I find the URL of a string?

php use VStelmakh\UrlHighlight\UrlHighlight; $urlHighlight = new UrlHighlight(); // Extract urls $urlHighlight->getUrls("This is example http://example.com."); // return: ['http://example.com'] // Make urls as hyperlinks $urlHighlight->highlightUrls('Hello, http://example.com.

How do I extract a URL from a string in Python?

URL extraction is achieved from a text file by using regular expression. The expression fetches the text wherever it matches the pattern. Only the re module is used for this purpose.

How would you extract the URL in Java?

In Java, this can be done by using Pattern. matcher(). Find the substring from the first index of match result to the last index of the match result and add this substring into the list.


1 Answers

Let me go ahead and preface this by saying that I'm not a huge advocate of regex for complex cases. Trying to write the perfect expression for something like this is very difficult. That said, I do happen to have one for detecting URL's and it's backed by a 350 line unit test case class that passes. Someone started with a simple regex and over the years we've grown the expression and test cases to handle the issues we've found. It's definitely not trivial:

// Pattern for recognizing a URL, based off RFC 3986 private static final Pattern urlPattern = Pattern.compile(         "(?:^|[\\W])((ht|f)tp(s?):\\/\\/|www\\.)"                 + "(([\\w\\-]+\\.){1,}?([\\w\\-.~]+\\/?)*"                 + "[\\p{Alnum}.,%_=?&#\\-+()\\[\\]\\*$~@!:/{};']*)",         Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL); 

Here's an example of using it:

Matcher matcher = urlPattern.matcher("foo bar http://example.com baz"); while (matcher.find()) {     int matchStart = matcher.start(1);     int matchEnd = matcher.end();     // now you have the offsets of a URL match } 
like image 165
WhiteFang34 Avatar answered Oct 15 '22 11:10

WhiteFang34