Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reshape from wide to long in big query (standard SQL)

Unfortunately reshaping in BQ it's not as easy as in R and I can't export my data for this project.

Here is input

date    country A             B         C      D
20170928    CH  3000.3        121       13     3200
20170929    CH  2800.31       137       23     1614.31

Expected output

date    country Metric  Value  
20170928    CH  A       3000.3  
20170928    CH  B       121     
20170928    CH  C       13     
20170928    CH  D       3200
20170929    CH  A       2800.31 
20170929    CH  B       137       
20170929    CH  C       23     
20170929    CH  D       1614.31

Also my table has many more columns and rows (but I assume a lot of manual will be required)

like image 796
AlienDeg Avatar asked Dec 05 '17 10:12

AlienDeg


People also ask

How do you flatten an array in a large query?

To flatten an entire column of ARRAY s while preserving the values of the other columns in each row, use a correlated cross join to join the table containing the ARRAY column to the UNNEST output of that ARRAY column.

How do you offset in BigQuery?

OFFSET means that the numbering starts at zero, ORDINAL means that the numbering starts at one. A given array can be interpreted as either 0-based or 1-based. When accessing an array element, you must preface the array position with OFFSET or ORDINAL , respectively; there is no default behavior.


3 Answers

Below is for BigQuery Standard SQL and does not require repeating selects depends on number of columns. It will pick as many as you have and transform them into metrics and values

#standardSQL
SELECT DATE, country,
  metric, SAFE_CAST(value AS FLOAT64) value
FROM (
  SELECT DATE, country, 
    REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(0)], r'^"|"$', '') metric, 
    REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(1)], r'^"|"$', '') value 
  FROM `project.dataset.yourtable` t, 
  UNNEST(SPLIT(REGEXP_REPLACE(to_json_string(t), r'{|}', ''))) pair
)
WHERE NOT LOWER(metric) IN ('date', 'country')

You can test / play with above using dummy data as in your question

#standardSQL
WITH `project.dataset.yourtable` AS (
  SELECT '20170928' DATE, 'CH' country, 3000.3 A, 121 B, 13 C, 3200 D UNION ALL
  SELECT '20170929', 'CH', 2800.31, 137, 23, 1614.31
)
SELECT DATE, country,
  metric, SAFE_CAST(value AS FLOAT64) value
FROM (
  SELECT DATE, country, 
    REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(0)], r'^"|"$', '') metric, 
    REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(1)], r'^"|"$', '') value 
  FROM `project.dataset.yourtable` t, 
  UNNEST(SPLIT(REGEXP_REPLACE(to_json_string(t), r'{|}', ''))) pair
)
WHERE NOT LOWER(metric) IN ('date', 'country')

result is as expected

DATE        country metric  value    
20170928    CH      A       3000.3   
20170928    CH      B       121.0    
20170928    CH      C       13.0     
20170928    CH      D       3200.0   
20170929    CH      A       2800.31  
20170929    CH      B       137.0    
20170929    CH      C       23.0     
20170929    CH      D       1614.31  
like image 169
Mikhail Berlyant Avatar answered Oct 17 '22 02:10

Mikhail Berlyant


You need UNION which is denoted using commas in bigquery

SELECT date, country, Metric, Value
FROM (
  SELECT date, country, 'A' as Metric,  A as Value FROM your_table
), (
  SELECT date, country, 'B' as Metric,  B as Value FROM your_table
), (
  SELECT date, country, 'C' as Metric,  C as Value FROM your_table
) , (
  SELECT date, country, 'D' as Metric,  D as Value FROM your_table
)
like image 36
Radim Bača Avatar answered Oct 17 '22 01:10

Radim Bača


Most answers that I managed to find required specifying the name of EVERY column to be melt. This is not tractable when I have hundreds to thousands of columns in the table. Here is an answer that works for an arbitrarily wide table.

It utilizes dynamic SQL and automatically extract multiple column names from the data schema, collate a command string and then evaluate that string. This is intended to mimic Python pandas.melt() / R reshape2::melt() behavior.

I intentionally did not create user defined functions because of some undesirable properties of UDFs. Depending on how you use this, you may or may not want to do that.

Input:

id0 id1 _2020_05_27 _2020_05_28
1   1   11          12
1   2   13          14
2   1   15          16
2   2   17          18

Output:

id0 id1 date         value
1   2   _2020_05_27  13
1   2   _2020_05_28  14
2   2   _2020_05_27  17
2   2   _2020_05_28  18
1   1   _2020_05_27  11
1   1   _2020_05_28  12
2   1   _2020_05_27  15
2   1   _2020_05_28  16
#standardSQL

-- PANDAS MELT FUNCTION IN GOOGLE BIGQUERY
-- author: Luna Huang
-- email: [email protected]

-- run this script with Google BigQuery Web UI in the Cloud Console

-- this piece of code functions like the pandas melt function
-- pandas.melt(id_vars, value_vars, var_name, value_name, col_level=None)
-- without utilizing user defined functions (UDFs)
-- see below for where to input corresponding arguments

DECLARE cmd STRING;
DECLARE subcmd STRING;
SET cmd = ("""
  WITH original AS (
    -- query to retrieve the original table
    %s
  ),
  nested AS (
    SELECT
    [
      -- sub command to be automatically generated
      %s
    ] as s,
    -- equivalent to id_vars in pandas.melt()
    %s,
    FROM original
  )
  SELECT
    -- equivalent to id_vars in pandas.melt()
    %s,
    -- equivalent to var_name in pandas.melt()
    s.key AS %s,
    -- equivalent to value_name in pandas.melt()
    s.value AS %s,
  FROM nested
  CROSS JOIN UNNEST(nested.s) AS s
""");
SET subcmd = ("""
  WITH
  columns AS (
    -- query to retrieve the column names
    -- equivalent to value_vars in pandas.melt()
    -- the resulting table should have only one column
    -- with the name: column_name
    %s
  ),
  scs AS (
    SELECT FORMAT("STRUCT('%%s' as key, %%s as value)", column_name, column_name) AS sc
    FROM columns
  )
  SELECT ARRAY_TO_STRING(ARRAY (SELECT sc FROM scs), ",\\n")
""");

-- -- -- EXAMPLE BELOW -- -- --

-- SET UP AN EXAMPLE TABLE --
CREATE OR REPLACE TABLE `tmp.example`
(
  id0 INT64,
  id1 INT64,
  _2020_05_27 INT64,
  _2020_05_28 INT64,
);
INSERT INTO `tmp.example` VALUES (1, 1, 11, 12);
INSERT INTO `tmp.example` VALUES (1, 2, 13, 14);
INSERT INTO `tmp.example` VALUES (2, 1, 15, 16);
INSERT INTO `tmp.example` VALUES (2, 2, 17, 18);

-- MELTING STARTS --
-- execute these two command to melt the table

-- the first generates the STRUCT commands
-- and saves a string in subcmd
EXECUTE IMMEDIATE FORMAT(
  -- please do not change this argument
  subcmd,
  -- query to retrieve the column names
  -- equivalent to value_vars in pandas.melt()
  -- the resulting table should have only one column
  -- with the name: column_name
  """
    SELECT column_name
    FROM `tmp.INFORMATION_SCHEMA.COLUMNS`
    WHERE (table_name = "example") AND (column_name NOT IN ("id0", "id1"))
  """
) INTO subcmd;

-- the second implements the melting
EXECUTE IMMEDIATE FORMAT(
  -- please do not change this argument
  cmd,
  -- query to retrieve the original table
  """
    SELECT *
    FROM `tmp.example`
  """,
  -- please do not change this argument
  subcmd,
  -- equivalent to id_vars in pandas.melt()
  -- !!please type these twice!!
  "id0, id1", "id0, id1",
  -- equivalent to var_name in pandas.melt()
  "date",
  -- equivalent to value_name in pandas.melt()
  "value"
);
like image 2
Luna Huang Avatar answered Oct 17 '22 01:10

Luna Huang