May I ask your help in order to build a regular expression to be used on Google Big Query using REGEXP_EXTRACT that will parse the full domain of a given input url?
Parsing conditions:
// in the url: after the first // occurrence//: from the beginning of the string? or the first / or the first & or until the end of the string if no ?, / or & are foundSome examples:
htp://www.google.com --> www.google.com
htp://www.google.com/item/ --> www.google.com
htp://www.google.com?source=google --> www.google.com
htp://www.google.com&source=google --> www.google.com
www.google.com --> www.google.com
www.google.com/item/ --> www.google.com
www.google.com?source=google --> www.google.com
www.google.com&source=google --> www.google.com
http://google.com&source=google --> google.com
https://www.example-code.com/vb/string.asp --> www.example-code.com
I created this REGEX:
REGEXP_EXTRACT('google.it?medium=cpc?cobranded=google&keyword=foo';, r'//([^/|^?|^&]+)')
But it's working only for urls that contain //, I can't get to have a regex that works also in case no // are in the url.
BigQuery provides the following three functions:
HOST() -- Given a URL, returns the hostname as a string.
DOMAIN()-- Given a URL, returns the domain as a string.
TLD() -- Given a URL, returns the top level domain plus any country domain in the URL.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With