Extract full domain from url in Google BigQuery using regex

Question

May I ask your help in order to build a regular expression to be used on Google Big Query using REGEXP_EXTRACT that will parse the full domain of a given input url?

Parsing conditions:

Start capturing should be:
- If there is a // in the url: after the first // occurrence
- If there is not a //: from the beginning of the string
End capturing should be: after the first ? or the first / or the first & or until the end of the string if no ?, / or & are found

Some examples:

htp://www.google.com --> www.google.com
htp://www.google.com/item/ --> www.google.com
htp://www.google.com?source=google --> www.google.com
htp://www.google.com&source=google --> www.google.com
www.google.com --> www.google.com
www.google.com/item/ --> www.google.com
www.google.com?source=google --> www.google.com
www.google.com&source=google --> www.google.com
http://google.com&source=google --> google.com
https://www.example-code.com/vb/string.asp --> www.example-code.com

I created this REGEX:

REGEXP_EXTRACT('google.it?medium=cpc?cobranded=google&keywor‌d=foo';, r'//([^/|^?|^&]+)')

But it's working only for urls that contain //, I can't get to have a regex that works also in case no // are in the url.

tenideas · Accepted Answer

BigQuery provides the following three functions:

HOST() -- Given a URL, returns the hostname as a string.

DOMAIN()-- Given a URL, returns the domain as a string.

TLD() -- Given a URL, returns the top level domain plus any country domain in the URL.

Extract full domain from url in Google BigQuery using regex

Tags:

regex

google-bigquery

Jonk

1 Answers

tenideas

Recent Activity

Donate For Us

Extract full domain from url in Google BigQuery using regex

Tags:

regex

google-bigquery

Jonk

1 Answers

tenideas

Related questions

Recent Activity

Donate For Us