Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

bigquery url decode

Is there an easy way to do URL decoding within the BigQuery query language? I'm working with a table that has a column containing URL-encoded strings in some values. For example:

http://xyz.com/example.php?url=http%3A%2F%2Fwww.example.com%2Fhello%3Fv%3D12345&foo=bar&abc=xyz

I extract the "url" parameter like so:

SELECT REGEXP_EXTRACT(column_name, "url=([^&]+)") as url 
from [mydataset.mytable]

which gives me:

http%3A%2F%2Fwww.example.com%2Fhello%3Fv%3D12345

What I would like to do is something like:

SELECT URL_DECODE(REGEXP_EXTRACT(column_name, "url=([^&]+)")) as url 
from [mydataset.mytable]

thereby returning:

http://www.example.com/hello?v=12345

I would like to avoid using multiple REGEXP_REPLACE() statements (replacing %20, %3A, etc...) if possible.

Ideas?

like image 845
David M Smith Avatar asked Dec 12 '12 01:12

David M Smith


Video Answer


4 Answers

Below is built on top of @sigpwned answer, but slightly refactored and wrapped with SQL UDF (which has no limitation that JS UDF has so safe to use)

#standardSQL
CREATE TEMP FUNCTION URLDECODE(url STRING) AS ((
  SELECT SAFE_CONVERT_BYTES_TO_STRING(
    ARRAY_TO_STRING(ARRAY_AGG(
        IF(STARTS_WITH(y, '%'), FROM_HEX(SUBSTR(y, 2)), CAST(y AS BYTES)) ORDER BY i
      ), b''))
  FROM UNNEST(REGEXP_EXTRACT_ALL(url, r"%[0-9a-fA-F]{2}|[^%]+")) AS y WITH OFFSET AS i 
));
SELECT 
  column_name, 
  URLDECODE(REGEXP_EXTRACT(column_name, "url=([^&]+)")) AS url
FROM `project.dataset.table`

can be tested with example from question as below

#standardSQL
CREATE TEMP FUNCTION URLDECODE(url STRING) AS ((
  SELECT SAFE_CONVERT_BYTES_TO_STRING(
    ARRAY_TO_STRING(ARRAY_AGG(
        IF(STARTS_WITH(y, '%'), FROM_HEX(SUBSTR(y, 2)), CAST(y AS BYTES)) ORDER BY i
      ), b''))
  FROM UNNEST(REGEXP_EXTRACT_ALL(url, r"%[0-9a-fA-F]{2}|[^%]+")) AS y WITH OFFSET AS i 
));
WITH `project.dataset.table` AS (
  SELECT 'http://example.com/example.php?url=http%3A%2F%2Fwww.example.com%2Fhello%3Fv%3D12345&foo=bar&abc=xyz' column_name
)
SELECT 
  URLDECODE(REGEXP_EXTRACT(column_name, "url=([^&]+)")) AS url,
  column_name
FROM `project.dataset.table`    

with result

Row url                                     column_name  
1   http://www.example.com/hello?v=12345    http://example.com/example.php?url=http%3A%2F%2Fwww.example.com%2Fhello%3Fv%3D12345&foo=bar&abc=xyz   

Update with further quite optimized SQL UDF

CREATE TEMP FUNCTION URLDECODE(url STRING) AS ((
  SELECT STRING_AGG(
    IF(REGEXP_CONTAINS(y, r'^%[0-9a-fA-F]{2}'), 
      SAFE_CONVERT_BYTES_TO_STRING(FROM_HEX(REPLACE(y, '%', ''))), y), '' 
    ORDER BY i
    )
  FROM UNNEST(REGEXP_EXTRACT_ALL(url, r"%[0-9a-fA-F]{2}(?:%[0-9a-fA-F]{2})*|[^%]+")) y
  WITH OFFSET AS i 
));
like image 108
Mikhail Berlyant Avatar answered Sep 30 '22 12:09

Mikhail Berlyant


One more workaround is using a user-defined function.

#standardSQL
CREATE TEMPORARY FUNCTION URL_DECODE(enc STRING)
RETURNS STRING
LANGUAGE js AS """
  try { 
    return decodeURI(enc);;
  } catch (e) { return null }
  return null;
""";

SELECT ven_session, 
  URL_DECODE(REGEXP_EXTRACT(para,r'&kw=(\w|[^&]*)')) AS q
FROM raas_system.weblog_20170327 
WHERE para like '%&kw=%'
LIMIT 10
like image 28
robinsu Avatar answered Sep 30 '22 10:09

robinsu


I agree with everyone here that URLDECODE should be a native function. However, until that happens, it is possible to write a "native" URLDECODE:

SELECT id, SAFE_CONVERT_BYTES_TO_STRING(ARRAY_TO_STRING(ps, b'')) FROM (SELECT
  id,
  ARRAY_AGG(CASE
    WHEN REGEXP_CONTAINS(y, r"^%") THEN FROM_HEX(SUBSTR(y, 2))
    ELSE CAST(y AS bytes)
  END ORDER BY i) AS ps
  FROM (SELECT x AS id, REGEXP_EXTRACT_ALL(x, r"%[0-9a-fA-F]{2}|[^%]+") AS element FROM UNNEST(ARRAY['domodossola%e2%80%93locarno railway', 'gabu%c5%82t%c3%b3w']) AS x) AS x
  CROSS JOIN UNNEST(x.element) AS y WITH OFFSET AS i GROUP BY id);

In this example, I've tried and tested the implementation with a couple of percent-encoded page names from Wikipedia as the input. It should work with your input, too.

Obviously, this is extremely unwieldly! For that reason, I'd suggest building a materialized join table, or wrapping this in a view, rather than using this expression "naked" in your query. However, it does appear to get the job done, and it doesn't hit the UDF limits.

EDIT: @MikhailBerylyant's post below has wrapped this cumbersome implementation into a nice, tidy little SQL UDF. That's a much better way to handle this!

like image 38
sigpwned Avatar answered Sep 30 '22 11:09

sigpwned


It's a good feature request, but currently there is no built in BigQuery function that provides URL decoding.

like image 28
Michael Manoochehri Avatar answered Sep 30 '22 10:09

Michael Manoochehri