Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pivot on Multiple Columns using Tablefunc

Has anyone used tablefunc to pivot on multiple variables as opposed to only using row name? The documentation notes:

The "extra" columns are expected to be the same for all rows with the same row_name value.

I'm not sure how to do this without combining the columns that I want to pivot on (which I highly doubt will give me the speed I need). One possible way to do this would be to make the entity numeric and add it to the localt as milliseconds, but this seems like a shaky way to proceed.

I've edited the data used in a response to this question: PostgreSQL Crosstab Query.

 CREATE TEMP TABLE t4 (
  timeof   timestamp
 ,entity    character
 ,status    integer
 ,ct        integer);

 INSERT INTO t4 VALUES 
  ('2012-01-01', 'a', 1, 1)
 ,('2012-01-01', 'a', 0, 2)
 ,('2012-01-02', 'b', 1, 3)
 ,('2012-01-02', 'c', 0, 4);

 SELECT * FROM crosstab(
     'SELECT timeof, entity, status, ct
      FROM   t4
      ORDER  BY 1,2,3'
     ,$$VALUES (1::text), (0::text)$$)
 AS ct ("Section" timestamp, "Attribute" character, "1" int, "0" int);

Returns:

 Section                   | Attribute | 1 | 0
---------------------------+-----------+---+---
 2012-01-01 00:00:00       |     a     | 1 | 2
 2012-01-02 00:00:00       |     b     | 3 | 4

So as the documentation states, the extra column aka 'Attribute' is assumed to be the same for each row name aka 'Section'. Thus, it reports b for the second row even though 'entity' also has a 'c' value for that 'timeof' value.

Desired Output:

Section                   | Attribute | 1 | 0
--------------------------+-----------+---+---
2012-01-01 00:00:00       |     a     | 1 | 2
2012-01-02 00:00:00       |     b     | 3 |  
2012-01-02 00:00:00       |     c     |   | 4

Any thoughts or references?

A little more background: I potentially need to do this for billions of rows and I'm testing out storing this data in long and wide formats and seeing if I can use tablefunc to go from long to wide format more efficiently than with regular aggregate functions.
I'll have about 100 measurements made every minute for around 300 entities. Often, we will need to compare the different measurements made for a given second for a given entity, so we will need to go to wide format very often. Also, the measurements made on a particular entity are highly variable.

EDIT: I found a resource on this: http://www.postgresonline.com/journal/categories/24-tablefunc.

like image 852
ideamotor Avatar asked Mar 14 '13 16:03

ideamotor


People also ask

Can you pivot on multiple columns?

To have multiple columns: Click in one of the cells of your pivot table. Click your right mouse button and select Pivot table Options in the context menu, this will open a form with tabs. Click on the tab Display and tag the check box Classic Pivot table layout.

How do you make a pivot chart in Excel with multiple columns?

Add an Additional Row or Column FieldClick any cell in the PivotTable. The PivotTable Fields pane appears. You can also turn on the PivotTable Fields pane by clicking the Field List button on the Analyze tab. Click and drag a field to the Rows or Columns area.

Can we pivot on multiple columns in SQL Server?

You gotta change the name of columns for next Pivot Statement. You can use aggregate of pv3 to sum and group by the column you need. The key point here is that you create new category values by appending 1 or 2 to the end. Without doing this, the pivot query won't work properly.


1 Answers

The problem with your query is that b and c share the same timestamp 2012-01-02 00:00:00, and you have the timestamp column timeof first in your query, so - even though you added bold emphasis - b and c are just extra columns that fall in the same group 2012-01-02 00:00:00. Only the first (b) is returned since (quoting the manual):

The row_name column must be first. The category and value columns must be the last two columns, in that order. Any columns between row_name and category are treated as "extra". The "extra" columns are expected to be the same for all rows with the same row_name value.

Bold emphasis mine.
Just revert the order of the first two columns to make entity the row name and it works as desired:

SELECT * FROM crosstab(
      'SELECT entity, timeof, status, ct
       FROM   t4
       ORDER  BY 1'
      ,'VALUES (1), (0)')
 AS ct (
    "Attribute" character
   ,"Section" timestamp
   ,"status_1" int
   ,"status_0" int);

entity must be unique, of course.

Reiterate

  • row_name first
  • (optional) extra columns next
  • category (as defined by the second parameter) and value last.

Extra columns are filled from the first row from each row_name partition. Values from other rows are ignored, there is only one column per row_name to fill. Typically those would be the same for every row of one row_name, but that's up to you.

For the different setup in your answer:

SELECT localt, entity
     , msrmnt01, msrmnt02, msrmnt03, msrmnt04, msrmnt05  -- , more?
FROM   crosstab(
        'SELECT dense_rank() OVER (ORDER BY localt, entity)::int AS row_name
              , localt, entity -- additional columns
              , msrmnt, val
         FROM   test
         -- WHERE  ???   -- instead of LIMIT at the end
         ORDER  BY localt, entity, msrmnt
         -- LIMIT ???'   -- instead of LIMIT at the end
     , $$SELECT generate_series(1,5)$$)  -- more?
     AS ct (row_name int, localt timestamp, entity int
          , msrmnt01 float8, msrmnt02 float8, msrmnt03 float8, msrmnt04 float8, msrmnt05 float8 -- , more?
            )
LIMIT 1000  -- ??!!

No wonder the queries in your test perform terribly. Your test setup has 14M rows and you process all of them before throwing most of it away with LIMIT 1000. For a reduced result set add WHERE conditions or a LIMIT to the source query!

Plus, the array you work with is needlessly expensive on top of it. I generate a surrogate row name with dense_rank() instead.

db<>fiddle here - with a simpler test setup and fewer rows.

like image 108
Erwin Brandstetter Avatar answered Oct 19 '22 13:10

Erwin Brandstetter