Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Crosstab with a large or undefined number of categories

My real problem has to do with recording which of a very large number of anti-virus products agree that a given sample is a member of a given anti-virus family. The database has millions of samples, with tens of anti-virus products voting on each sample. I want to ask a query like "For the malware containing the name 'XYZ' which sample had the most votes, and which vendors voted for it?" and get results like:

"BadBadVirus"  
                     V1  V2  V3  V4  V5  V6  V7  
Sample 1 - 4 votes    1   0   1   0   0   1   1      
Sample 2 - 5 votes    1   0   1   0   1   1   1   
Sample 3 - 5 votes    1   0   1   0   1   1   1  

 total     14         3       3       2   3   3  

Which might be used to tell me that Vendor 2 and Vendor 4 either don't know how to detect this malware, or that they name it something different.


I'm going to try to generalize my question slightly while hopefully not breaking your ability to help me. Assume that I have five voters (Alex, Bob, Carol, Dave, Ed) who have been asked to look at five photographs (P1, P2, P3, P4, P5) and decide what the "main subject" of the photograph is. For our example, we'll just assume they were limited to "Cat", "Dog", or "Horse". Not every voter votes on every thing.

The data is in the database in this form:

Photo, Voter, Decision
(1, 'Alex', 'Cat')
(1, 'Bob', 'Dog')
(1, 'Carol', 'Cat')
(1, 'Dave', 'Cat')
(1, 'Ed', 'Cat')
(2, 'Alex', 'Cat')
(2, 'Bob', 'Dog')
(2, 'Carol', 'Cat')
(2, 'Dave', 'Cat')
(2, 'Ed', 'Dog')
(3, 'Alex', 'Horse')
(3, 'Bob', 'Horse')
(3, 'Carol', 'Dog')
(3, 'Dave', 'Horse')
(3, 'Ed', 'Horse')
(4, 'Alex', 'Horse')
(4, 'Bob', 'Horse')
(4, 'Carol', 'Cat')
(4, 'Dave', 'Horse')
(4, 'Ed', 'Horse')
(5, 'Alex', 'Dog')
(5, 'Bob', 'Cat')
(5, 'Carol', 'Cat')
(5, 'Dave', 'Cat')
(5, 'Ed', 'Cat')

The objective is that given a photo topic we are looking for, we'd like to know how many voters thought that WAS the main point of that photo, but also list WHICH VOTERS thought that.

Query for: "Cat"
      Total  Alex  Bob Carol Dave Ed
1 -     4      1    0    1     1   1
2 -     3      1    0    1     1   0 
3 -     0      0    0    0     0   0 
4 -     1      0    0    1     0   0 
5 -     4      0    1    1     1   1
------------------------------------
total  12      2    1    4     3   2 

Query for: "Dog"
      Total  Alex  Bob Carol Dave Ed
1 -     1     0      1   0    0    0
2 -     2     0      1   0    0    1
3 -     1     0      0   1    0    0 
4 -     0     0      0   0    0    0 
5 -     1     1      0   0    0    0 
------------------------------------
total   5     1      2   1    0    1 

Is that something I can do with the data in the format that I have it stored?

I'm having difficulty getting a query that does that - although it's simple enough to dump the data out and then write a program to do that, I'd really like to be able to do it IN THE DATABASE if I can.

Thanks for any suggestions.

like image 606
user1761471 Avatar asked Oct 20 '12 12:10

user1761471


People also ask

What is Crosstab example?

Cross tabulation helps you understand how two different variables are related to each other. For example, suppose you wanted to see if there is a relationship between the gender of the survey responder and if sex education in high school is important.

What is a crosstab?

Cross-tabulation (also cross-tabulation or crosstab) is one of the most useful analytical tools and a mainstay of the market research industry. Cross-tabulation analysis, also known as contingency table analysis, is most often used to analyze categorical (nominal measurement scale) data.

What is crosstab in Excel?

Crosstab stands for Cross tabulation, a process by which totals and other calculations are performed based on common values found in a set of data. In Microsoft Excel™ the term "Pivot Table" is used for a Crosstab.


1 Answers

create table vote (Photo integer, Voter text, Decision text);
insert into vote values
(1, 'Alex', 'Cat'),
(1, 'Bob', 'Dog'),
(1, 'Carol', 'Cat'),
(1, 'Dave', 'Cat'),
(1, 'Ed', 'Cat'),
(2, 'Alex', 'Cat'),
(2, 'Bob', 'Dog'),
(2, 'Carol', 'Cat'),
(2, 'Dave', 'Cat'),
(2, 'Ed', 'Dog'),
(3, 'Alex', 'Horse'),
(3, 'Bob', 'Horse'),
(3, 'Carol', 'Dog'),
(3, 'Dave', 'Horse'),
(3, 'Ed', 'Horse'),
(4, 'Alex', 'Horse'),
(4, 'Bob', 'Horse'),
(4, 'Carol', 'Cat'),
(4, 'Dave', 'Horse'),
(4, 'Ed', 'Horse'),
(5, 'Alex', 'Dog'),
(5, 'Bob', 'Cat'),
(5, 'Carol', 'Cat'),
(5, 'Dave', 'Cat'),
(5, 'Ed', 'Cat')
;

The query for the cat:

select photo,
    alex + bob + carol + dave + ed as Total,
    alex, bob, carol, dave, ed
from crosstab($$
    select
        photo, voter,
        case decision when 'Cat' then 1 else 0 end
    from vote
    order by photo
    $$,'
    select distinct voter
    from vote
    order by voter
    '
) as (
    photo integer,
    Alex integer,
    Bob integer,
    Carol integer,
    Dave integer,
    Ed integer
);
 photo | total | alex | bob | carol | dave | ed 
-------+-------+------+-----+-------+------+----
     1 |     4 |    1 |   0 |     1 |    1 |  1
     2 |     3 |    1 |   0 |     1 |    1 |  0
     3 |     0 |    0 |   0 |     0 |    0 |  0
     4 |     1 |    0 |   0 |     1 |    0 |  0
     5 |     4 |    0 |   1 |     1 |    1 |  1

If the number of voters is big or not known then it can be done dynamically:

do $do$
declare
voter_list text;
r record;
begin

drop table if exists pivot;

voter_list := (
    select string_agg(distinct voter, ' ' order by voter) from vote
    );

execute(format('
    create table pivot (
        decision text,
        photo integer,
        Total integer,
        %1$s
    )', (replace(voter_list, ' ', ' integer, ') || ' integer')
));

for r in
select distinct decision from vote
loop
    execute (format($f$
        insert into pivot
        select
            %3$L as decision,
            photo,
            %1$s as Total,
            %2$s
        from crosstab($ct$
            select
                photo, voter,
                case decision when %3$L then 1 else 0 end
            from vote
            order by photo
            $ct$,$ct$
            select distinct voter
            from vote
            order by voter
            $ct$
        ) as (
            photo integer,
            %4$s
        );$f$,
        replace(voter_list, ' ', ' + '),
        replace(voter_list, ' ', ', '),
        r.decision,
        replace(voter_list, ' ', ' integer, ') || ' integer'
    ));
end loop;
end; $do$;

The above code created the table pivot with all the decisions:

select * from pivot where decision = 'Cat';
like image 73
Clodoaldo Neto Avatar answered Sep 22 '22 00:09

Clodoaldo Neto