Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract full domain from url in Google BigQuery using regex

May I ask your help in order to build a regular expression to be used on Google Big Query using REGEXP_EXTRACT that will parse the full domain of a given input url?

Parsing conditions:

  • Start capturing should be:
    • If there is a // in the url: after the first // occurrence
    • If there is not a //: from the beginning of the string
  • End capturing should be: after the first ? or the first / or the first & or until the end of the string if no ?, / or & are found

Some examples:

htp://www.google.com --> www.google.com
htp://www.google.com/item/ --> www.google.com
htp://www.google.com?source=google --> www.google.com
htp://www.google.com&source=google --> www.google.com
www.google.com --> www.google.com
www.google.com/item/ --> www.google.com
www.google.com?source=google --> www.google.com
www.google.com&source=google --> www.google.com
http://google.com&source=google --> google.com
https://www.example-code.com/vb/string.asp --> www.example-code.com

I created this REGEX:

REGEXP_EXTRACT('google.it?medium=cpc?cobranded=google&keywor‌​d=foo';, r'//([^/|^?|^&]+)')

But it's working only for urls that contain //, I can't get to have a regex that works also in case no // are in the url.

like image 392
Jonk Avatar asked Dec 29 '25 21:12

Jonk


1 Answers

BigQuery provides the following three functions:

HOST() -- Given a URL, returns the hostname as a string.

DOMAIN()-- Given a URL, returns the domain as a string.

TLD() -- Given a URL, returns the top level domain plus any country domain in the URL.

like image 156
tenideas Avatar answered Jan 01 '26 11:01

tenideas



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!