Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mysql query to extract domains from urls

Tags:

mysql

I have this query to extract domain from URLs...

SELECT SUBSTRING(LEFT(url, LOCATE('/', url, 8) - 1), 8) AS domain...

It works only when the URL is something like www.google.com/something. It doesn't work with URLs like www.google.it (without trailing slash) or www.google.it/abc/xzy/ (permalink).

Do you know how to fix this query to make it more robust to different types of URLs?

like image 304
peppolone Avatar asked Sep 02 '25 14:09

peppolone


2 Answers

I had to combine some of the previous answers , plus a little more hackery for my data set . This is what works for me , it returns the domain and any sub-domains:

SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(target_url, '/', 3), '://', -1), '/', 1), '?', 1) AS domain

Explanation ( cause non-trivial SQL rarely makes sense ):

SUBSTRING_INDEX(target_url, '/', 3) - strips any path if the url has a protocol
SUBSTRING_INDEX(THAT, '://', -1) - strips any protocol from THAT
SUBSTRING_INDEX(THAT, '/', 1) - strips any path from THAT ( if there was no protocol )
SUBSTRING_INDEX(THAT, '?', 1) - strips the query string from THAT ( if there was no path or trailing / )

Test Cases:

SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(target_url, '/', 3), '://', -1), '/', 1), '?', 1) AS domain
FROM ( 
    SELECT       'http://test.com' as target_url 
    UNION SELECT 'https://test.com' 
    UNION SELECT 'http://test.com/one' 
    UNION SELECT 'http://test.com/?huh' 
    UNION SELECT 'http://test.com?http://ouch.foo' 
    UNION SELECT 'test.com' 
    UNION SELECT 'test.com/one'
    UNION SELECT 'test.com/one/two'
    UNION SELECT 'test.com/one/two/three'
    UNION SELECT 'test.com/one/two/three?u=http://maaaaannn'
    UNION SELECT 'http://one.test.com'
    UNION SELECT 'one.test.com/one'
    UNION SELECT 'two.one.test.com/one' ) AS Test; 

Results:

'test.com'
'test.com'
'test.com'
'test.com'
'test.com'
'test.com'
'test.com'
'test.com'
'test.com'
'test.com'
'one.test.com'
'one.test.com'
'two.one.test.com'
like image 74
LostNomad311 Avatar answered Sep 05 '25 03:09

LostNomad311


remove www., anysubdomain and everything after /:

SUBSTRING_INDEX((SUBSTRING_INDEX((SUBSTRING_INDEX(url, 'http://', -1)), '/', 1)), '.', -2) as domain
like image 21
eville84 Avatar answered Sep 05 '25 03:09

eville84