Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Javascript: extract URLs from string (inc. querystring) and return array

I know this has been asked a thousand times before (apologies), but searching SO/Google etc I am yet to get a conclusive answer.

Basically, I need a JS function which when passed a string, identifies & extracts all URLs based on a regex, returning an array of all found. e.g:

function findUrls(searchText){
    var regex=???
    result= searchText.match(regex);
    if(result){return result;}else{return false;}
}

The function should be able to detect and return any potential urls. I am aware of the inherant difficulties/isses with this (closing parentheses etc), so I have a feeling the process needs to be:

Split the string (searchText) into distinct sections starting/ending) with either nothing, a space or carriage return either side of it, resulting in distinct content chunks, e.g. do a split.

For each content chunk that results from the split, see whether it fits the logic for a URL of any construction, namely, does it contain a period immediately followed the text (the one constant rule for qualifying a potential URL).

The regex should see whether the period is immediately followed by other text, of the type allowable for a tld, directory structure & query string, and preceded by text of the allowable type for a URL.

I am aware false positives may result, however any returned values will then be checked with a call to the URL itself, so this can be ignored. The other functions I have found often dont return the URLs query string too, if present.

From a block of text, the function should thus be able to return any type of URL, even if it means identifying will.i.am as a valid one!

eg. http://www.google.com, google.com, www.google.com, http://google.com, ftp.google.com, https:// etc...and any derivation thereof with a query string should be returned...

Many thanks, apologies again if this exists elsewhere on SO but my searches havent returned it..

like image 635
SW4 Avatar asked Jun 26 '12 13:06

SW4


4 Answers

I just use URI.js -- makes it easy.

var source = "Hello www.example.com,\n"
    + "http://google.com is a search engine, like http://www.bing.com\n"
    + "http://exämple.org/foo.html?baz=la#bumm is an IDN URL,\n"
    + "http://123.123.123.123/foo.html is IPv4 and "
    + "http://fe80:0000:0000:0000:0204:61ff:fe9d:f156/foobar.html is IPv6.\n"
    + "links can also be in parens (http://example.org) "
    + "or quotes »http://example.org«.";

var result = URI.withinString(source, function(url) {
    return "<a>" + url + "</a>";
});

/* result is:
Hello <a>www.example.com</a>,
<a>http://google.com</a> is a search engine, like <a>http://www.bing.com</a>
<a>http://exämple.org/foo.html?baz=la#bumm</a> is an IDN URL,
<a>http://123.123.123.123/foo.html</a> is IPv4 and <a>http://fe80:0000:0000:0000:0204:61ff:fe9d:f156/foobar.html</a> is IPv6.
links can also be in parens (<a>http://example.org</a>) or quotes »<a>http://example.org</a>«.
*/
  • https://github.com/medialize/URI.js
  • http://medialize.github.io/URI.js/
like image 134
chovy Avatar answered Nov 09 '22 10:11

chovy


Following regular expression extract URLs from string (inc. query string) and returns array

var url = "asdasdla hakjsdh aaskjdh https://www.google.com/search?q=add+a+element+to+dom+tree&oq=add+a+element+to+dom+tree&aqs=chrome..69i57.7462j1j1&sourceid=chrome&ie=UTF-8 askndajk nakjsdn aksjdnakjsdnkjsn";

var matches = strings.match(/\bhttps?::\/\/\S+/gi) || strings.match(/\bhttps?:\/\/\S+/gi);

Output:

["https://www.google.com/search?q=format+to+6+digir&…s=chrome..69i57.5983j1j1&sourceid=chrome&ie=UTF-8"]

Note: This handles both http:// with single colon and http::// with double colon in string, vice versa for https, So it's safe for you to use. :)

like image 27
Manoj Selvin Avatar answered Nov 09 '22 09:11

Manoj Selvin


You could use the regex from URI.js:

// gruber revised expression - http://rodneyrehm.de/t/url-regex.html
var uri_pattern = /\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/ig;

String#match and or String#replace may help…

like image 16
rodneyrehm Avatar answered Nov 09 '22 09:11

rodneyrehm


try this

var expression = /[-a-zA-Z0-9@:%_\+.~#?&//=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9@:%_\+.~#?&//=]*)?/gi;

you could use this website to test regexp http://gskinner.com/RegExr/

like image 1
Naigel Avatar answered Nov 09 '22 10:11

Naigel