Reshape from wide to long in big query (standard SQL)

Tags:

google-bigquery

Unfortunately reshaping in BQ it's not as easy as in R and I can't export my data for this project.

Here is input

date    country A             B         C      D
20170928    CH  3000.3        121       13     3200
20170929    CH  2800.31       137       23     1614.31

Expected output

date    country Metric  Value  
20170928    CH  A       3000.3  
20170928    CH  B       121     
20170928    CH  C       13     
20170928    CH  D       3200
20170929    CH  A       2800.31 
20170929    CH  B       137       
20170929    CH  C       23     
20170929    CH  D       1614.31

Also my table has many more columns and rows (but I assume a lot of manual will be required)

796

asked Dec 05 '17 10:12

3 Answers

Below is for BigQuery Standard SQL and does not require repeating selects depends on number of columns. It will pick as many as you have and transform them into metrics and values

#standardSQL
SELECT DATE, country,
  metric, SAFE_CAST(value AS FLOAT64) value
FROM (
  SELECT DATE, country, 
    REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(0)], r'^"|"$', '') metric, 
    REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(1)], r'^"|"$', '') value 
  FROM `project.dataset.yourtable` t, 
  UNNEST(SPLIT(REGEXP_REPLACE(to_json_string(t), r'{|}', ''))) pair
)
WHERE NOT LOWER(metric) IN ('date', 'country')

You can test / play with above using dummy data as in your question

#standardSQL
WITH `project.dataset.yourtable` AS (
  SELECT '20170928' DATE, 'CH' country, 3000.3 A, 121 B, 13 C, 3200 D UNION ALL
  SELECT '20170929', 'CH', 2800.31, 137, 23, 1614.31
)
SELECT DATE, country,
  metric, SAFE_CAST(value AS FLOAT64) value
FROM (
  SELECT DATE, country, 
    REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(0)], r'^"|"$', '') metric, 
    REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(1)], r'^"|"$', '') value 
  FROM `project.dataset.yourtable` t, 
  UNNEST(SPLIT(REGEXP_REPLACE(to_json_string(t), r'{|}', ''))) pair
)
WHERE NOT LOWER(metric) IN ('date', 'country')

result is as expected

DATE        country metric  value    
20170928    CH      A       3000.3   
20170928    CH      B       121.0    
20170928    CH      C       13.0     
20170928    CH      D       3200.0   
20170929    CH      A       2800.31  
20170929    CH      B       137.0    
20170929    CH      C       23.0     
20170929    CH      D       1614.31

169

answered Oct 17 '22 02:10

Most answers that I managed to find required specifying the name of EVERY column to be melt. This is not tractable when I have hundreds to thousands of columns in the table. Here is an answer that works for an arbitrarily wide table.

It utilizes dynamic SQL and automatically extract multiple column names from the data schema, collate a command string and then evaluate that string. This is intended to mimic Python pandas.melt() / R reshape2::melt() behavior.

I intentionally did not create user defined functions because of some undesirable properties of UDFs. Depending on how you use this, you may or may not want to do that.

Input:

id0 id1 _2020_05_27 _2020_05_28
1   1   11          12
1   2   13          14
2   1   15          16
2   2   17          18

Output:

id0 id1 date         value
1   2   _2020_05_27  13
1   2   _2020_05_28  14
2   2   _2020_05_27  17
2   2   _2020_05_28  18
1   1   _2020_05_27  11
1   1   _2020_05_28  12
2   1   _2020_05_27  15
2   1   _2020_05_28  16

#standardSQL

-- PANDAS MELT FUNCTION IN GOOGLE BIGQUERY
-- author: Luna Huang
-- email: [email protected]

-- run this script with Google BigQuery Web UI in the Cloud Console

-- this piece of code functions like the pandas melt function
-- pandas.melt(id_vars, value_vars, var_name, value_name, col_level=None)
-- without utilizing user defined functions (UDFs)
-- see below for where to input corresponding arguments

DECLARE cmd STRING;
DECLARE subcmd STRING;
SET cmd = ("""
  WITH original AS (
    -- query to retrieve the original table
    %s
  ),
  nested AS (
    SELECT
    [
      -- sub command to be automatically generated
      %s
    ] as s,
    -- equivalent to id_vars in pandas.melt()
    %s,
    FROM original
  )
  SELECT
    -- equivalent to id_vars in pandas.melt()
    %s,
    -- equivalent to var_name in pandas.melt()
    s.key AS %s,
    -- equivalent to value_name in pandas.melt()
    s.value AS %s,
  FROM nested
  CROSS JOIN UNNEST(nested.s) AS s
""");
SET subcmd = ("""
  WITH
  columns AS (
    -- query to retrieve the column names
    -- equivalent to value_vars in pandas.melt()
    -- the resulting table should have only one column
    -- with the name: column_name
    %s
  ),
  scs AS (
    SELECT FORMAT("STRUCT('%%s' as key, %%s as value)", column_name, column_name) AS sc
    FROM columns
  )
  SELECT ARRAY_TO_STRING(ARRAY (SELECT sc FROM scs), ",\\n")
""");

-- -- -- EXAMPLE BELOW -- -- --

-- SET UP AN EXAMPLE TABLE --
CREATE OR REPLACE TABLE `tmp.example`
(
  id0 INT64,
  id1 INT64,
  _2020_05_27 INT64,
  _2020_05_28 INT64,
);
INSERT INTO `tmp.example` VALUES (1, 1, 11, 12);
INSERT INTO `tmp.example` VALUES (1, 2, 13, 14);
INSERT INTO `tmp.example` VALUES (2, 1, 15, 16);
INSERT INTO `tmp.example` VALUES (2, 2, 17, 18);

-- MELTING STARTS --
-- execute these two command to melt the table

-- the first generates the STRUCT commands
-- and saves a string in subcmd
EXECUTE IMMEDIATE FORMAT(
  -- please do not change this argument
  subcmd,
  -- query to retrieve the column names
  -- equivalent to value_vars in pandas.melt()
  -- the resulting table should have only one column
  -- with the name: column_name
  """
    SELECT column_name
    FROM `tmp.INFORMATION_SCHEMA.COLUMNS`
    WHERE (table_name = "example") AND (column_name NOT IN ("id0", "id1"))
  """
) INTO subcmd;

-- the second implements the melting
EXECUTE IMMEDIATE FORMAT(
  -- please do not change this argument
  cmd,
  -- query to retrieve the original table
  """
    SELECT *
    FROM `tmp.example`
  """,
  -- please do not change this argument
  subcmd,
  -- equivalent to id_vars in pandas.melt()
  -- !!please type these twice!!
  "id0, id1", "id0, id1",
  -- equivalent to var_name in pandas.melt()
  "date",
  -- equivalent to value_name in pandas.melt()
  "value"
);

answered Oct 17 '22 01:10

Luna Huang

Related questions
                            
                                Fastest way to copy sql table
                            
                                TSQL - Invalid Column Name RowNumber [duplicate]
                            
                                NULL vs 0 in integer fields
                            
                                SQL Replace multiple variables from another table in query result
                            
                                Converting from base64 string to varbinary(max) in SQL Server
                            
                                Return rows of a table that actually changed in an UPDATE
                            
                                Output "yes/no" instead of "t/f" for boolean data type in PostgreSQL
                            
                                Check if all three columns are either not null or null
                            
                                What is the equivalent of XML PATH and Stuff in Linq lambda expression (GROUP_CONCAT/STRING_AGG)?
                            
                                Stored procedure: reduce code duplication using temp tables
                            
                                Multiple column values into a single column as comma separated value
                            
                                Running Multiplication in T-SQL
                            
                                Performance of DELETE with NOT IN (SELECT ...)
                            
                                NullHandling.NULLS_LAST not working
                            
                                postgres syntax error at or near "ON"
                            
                                select rows and generate a new column based on condition SQL
                            
                                Postgresql - how to run a query on multiple tables with same schema
                            
                                SQL Server Bulk Insert - 0 row(s) affected
                            
                                Date and Interval Addition in SparkSQL
                            
                                How to extract selected values from json string in Hive

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Reshape from wide to long in big query (standard SQL)

Tags:

sql

google-bigquery

AlienDeg

People also ask

3 Answers

Mikhail Berlyant

Radim Bača

Luna Huang

Recent Activity

Donate For Us