Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm or SQL : to find where conditions for a set of columns which ensures result set has value in a particular column always > 0

I am working on a java-oracle based project where I stuck with an problem which seems to me requires an analytic solution. I am looking for solution either based on SQL query or any algorithm or any free analytic tool which I can follow to get desired results.

Problem statement: Lets us say I have below table with columnA-D and last column as Score, I want to find an criteria on values for each of the columns which when combined in SQL where clause will always give me positive value for Score column. So basically what combination of columnA-D will always give me positive score?

columnA|columnB|columnC|columnD|Score
  1      40      10       3     -20
  0      40      2        3      10
  0      10      3        3      20
  1      15      3        3     -5
  0      10      2        2     -15
  0      15      6        3     -10

Expected result for above data set:- Visual interpretation of above data set gives me condition: “ColumnA =0 and columnB >10 and columnC <5 will ensure score always >0”. (visually its clear columnD does not have an effect).

Please note above data set is for sake of simplicity. In reality, my project contains around 40 columns with almost 2500 rows. One thing is for sure each of columns have finite range of values.


Following information copied from OPs answer below

Here is an algorithm I started with (need inputs to refine it further if someone thinks I am in right direction):

Preparation: Create an list of all possible expressions like A=0, B>10,C<5 (for 40 columns, I finalized total approx 150 expressions)

Let us call it "expressions" variable.

Alogrithm for 1st run:

  1. set totalPositiveRows= select count(*) from my tables where score>0;

    set totalNegativeRows= select count(*) from my tables where score<0;

  2. For each expr in expressions, calculate following three variables set positivePercentage= find percentage of totalPositiveRows which satisfy this expr; //like if 60 rows out of total 100 rows having score>0 satisfy expr , then positivePercentage=60%

    set negativePercentage= find percentage of totalNegativeRows which satisfy this expr; //like if 40 rows out of total 100 rows  having score<0 satisfy expr , then negativePercentage=40%
    
    set diffPercentage=positivePercentage-negativePercentage;
    
  3. Set initialexpr=Choose expr having maximum value of diffPercentage set initalPositivePercentage=choose corresponding positivePercentage value; set initalNegativePercentage=choose corresponding negativePercentage value; My thinking is that I need to now keep expanding initalexpr until initalNegativePercentage becomes 0.

Alogrithm for subsequent runs until initalNegativePercentage becomes 0:-

  1. For each expr in expressions, calculate following three variables
    set newexpr=initialexpr+" and "+expr; set positivePercentage= find percentage of totalPositiveRows which satisfy newexpr; set negativePercentage= find percentage of totalNegativeRows which satisfy newexpr;

    //calculate how much negative percentage it has reduced? set positiveReduction=initalPositivePercentage-positivePercentage; set negativeReduction=initalNegativePercentage-negativePercentage; if(negativeReduction>=positiveReduction) //note it down else //discard it

  2. Choose the expr which gives maxium negative reduction, that becomes new inital expr. Set initialexpr=Choose expr having maximum value of negativeReduction set initalPositivePercentage=choose corresponding value; set initalNegativePercentage=choose corresponding value;

  3. Repeat the algorithm above.

Please comment.

like image 923
ag112 Avatar asked Jan 29 '15 06:01

ag112


People also ask

Which command is used to see information like columns data type size?

You can list a table's columns with the mysqlshow db_name tbl_name command. The DESCRIBE statement provides information similar to SHOW COLUMNS .

How do you retrieve all the columns from a table?

You can use an asterisk character, *, to retrieve all the columns. In queries where all the data is found in one table, the FROM clause is where we specify the name of the table from which to retrieve rows.

Which method throws NULL data is not present in the database?

By far the simplest and most straightforward method for ensuring a particular column's result set doesn't contain NULL values is to use the IS NOT NULL comparison operator.

Which columns in the movie table have NULL values?

Every movie in the movie has a genre (action, comedy, adventure), but some movies have not yet been classified by genre. These records have NULL values in the genres column. In the genres table, there is a matching ID for NULL. So, the movies table has the following columns: movieid , title , and genres .


2 Answers

Below is a brute-force solution. This might also be a good question for the theoretical computer science site. I think this is an NP-complete problem similar to Boolean satisfiability, but that's just a wild guess. There may be a smarter way to solve this but I don't think I'll find it.

The basic idea is to cross join every expression with every distinct value for a column, and then cross join all the columns. The table will be queried with every expression list, and a count generated for positive and negative scores. If the expression returns the expected number of positive scores and none of the negative scores it is valid.

This assumes that you are only using the expressions >, <, and =. Every new column or expression will make this problem exponentially slower.

Test schema

drop table table1;
create table table1(a number, b number, c number, d number, score number);

insert into table1
select  1,      40,      10,       3,     -20 from dual union all
select  0,      40,      2,        3,      10 from dual union all
select  0,      10,      3,        3,      20 from dual union all
select  1,      15,      3,        3,     -5  from dual union all
select  0,      10,      2,        2,     -15 from dual union all
select  0,      15,      6,        3,     -10 from dual;

Wall of code

declare
    v_inline_view varchar2(32767);
    v_inline_views clob;
    v_inline_view_counter number := 0;
    v_and_expression varchar2(32767);
    v_query clob;

    v_sqls sys.odcivarchar2list;
    v_dynamic_query_counter number := 0;
begin
    --#1: Create inline views.
    --One for every combination of expression and distinct value, per column.
    for inline_views in
    (
        --Inline view for every possible expression for each column.
        select
            replace(q'[
            (
                select *
                from
                (
                    --Possible expressions.
                    select distinct
                        case
                            when operator is null then null
                            else ' AND %%COLUMN%% '||operator||' '||%%COLUMN%%
                        end %%COLUMN%%_expression
                    from
                    --All operators.
                    (
                        select '>'  operator from dual union all
                        select '<'  operator from dual union all
                        select '='  operator from dual union all
                        select null operator from dual
                    )
                    --All distinct values.
                    cross join
                    (
                        select distinct %%COLUMN%% from table1
                    )
                )
                --where %%COLUMN%%_expression is null or %%COLUMN%%_expression %%EXPRESSION_PERFORMANCE_EXCLUSIONS%%
            )
            ]', '%%COLUMN%%', column_name) inline_view
        from user_tab_columns
        where table_name = 'TABLE1'
            and column_name <> 'SCORE'
        order by column_name
    ) loop
        --Assign to temorary so it can be modified.
        v_inline_view := inline_views.inline_view;

        --#1A: Optimize inline view - throw out expressions if they don't return any positive results.
        declare
            v_expressions sys.odcivarchar2list;
            v_expressions_to_ignore varchar2(32767);
            v_has_score_gt_0 number;
        begin
            --Gather expressions for one column.
            execute immediate v_inline_view bulk collect into v_expressions;

            --Loop through and test each expression.
            for i in 1 .. v_expressions.count loop
                --Always keep the null expression.
                if v_expressions(i) is not null then
                    --Count the number of rows with a positive score.
                    execute immediate 'select nvl(max(case when score > 0 then 1 else 0 end), 0) from table1 where '||replace(v_expressions(i), ' AND ', null)
                    into v_has_score_gt_0;

                    --If the expression returns nothing positive then add it to exclusion.
                    if v_has_score_gt_0 = 0 then
                        v_expressions_to_ignore := v_expressions_to_ignore||','''||v_expressions(i)||'''';
                    end if;
                end if;
            end loop;

            --Convert it into an IN clause.
            if v_expressions_to_ignore is not null then
                --Remove comment, replace placeholder with expression exclusions.
                v_inline_view := regexp_replace(v_inline_view, '(.*)(--where)( .* )(%%EXPRESSION_PERFORMANCE_EXCLUSIONS%%)(.*)', '\1where\3 not in ('||substr(v_expressions_to_ignore, 2)||')');
            end if;
        end;

        --Aggregate and count inline views.
        if v_inline_view_counter <> 0 then
            v_inline_views := v_inline_views||'cross join';
        end if;

        v_inline_views := v_inline_views||v_inline_view;

        v_inline_view_counter := v_inline_view_counter + 1;
    end loop;

    --#2: Create an AND expression to combine all column expressions.
    select listagg(column_name||'_expression', '||') within group (order by column_name)
    into v_and_expression
    from user_tab_columns
    where table_name = 'TABLE1'
        and column_name <> 'SCORE';


    --#3: Create a that will create all possible expression combinations.
    v_query := 
    replace(replace(q'[
        --8281 combinations
        select '
            select
                '''||expressions||''' expression,
                nvl(sum(case when score > 0 then 1 else 0 end), 0) gt_0_score_count,
                nvl(sum(case when score <= 0 then 1 else 0 end), 0) le_0_score_count
            from table1
            where 1=1 '||expressions v_sql
        from
        (
            --Combine expressions
            select %%AND_EXPRESSION%% expressions
            from
            %%INLINE_VIEWS%%
        ) combined_expressions
    ]', '%%AND_EXPRESSION%%', v_and_expression), '%%INLINE_VIEWS%%', v_inline_views);

    --TEST: It might be useful to see the full query here.
    --dbms_output.put_line(v_query);

    --#4: Gather expressions.
    --With larger input you'll want to use a LIMIT
    execute immediate v_query
    bulk collect into v_sqls;

    --#5: Test each expression.  
    --Look for any queries that return the right number of rows.
    for i in 1 .. v_sqls.count loop
        declare
            v_expression varchar2(4000);
            v_gt_0_score_count number;
            v_le_0_score_count number;
        begin
            execute immediate v_sqls(i) into v_expression, v_gt_0_score_count, v_le_0_score_count;
            v_dynamic_query_counter := v_dynamic_query_counter + 1;

            --TODO: Dynamically generate "2".
            if v_gt_0_score_count = 2 and v_le_0_score_count = 0 then
                dbms_output.put_line('Expression: '||v_expression);
            end if;

        exception when others then
            dbms_output.put_line('Problem with: '||v_sqls(i));
        end;
    end loop;

    dbms_output.put_line('Queries executed: '||v_dynamic_query_counter);

end;
/

Results

The results appear correct. They are slightly different than yours because "columnB > 10" is incorrect.

Expression:  AND A = 0 AND C < 6 AND D = 3
Expression:  AND A = 0 AND C < 6 AND D > 2
Expression:  AND A < 1 AND C < 6 AND D = 3
Expression:  AND A < 1 AND C < 6 AND D > 2
Queries executed: 441

Problems

This brute-force approach is extremely inefficient in at least two ways. Even for this simple example it requires 6370 queries.. And the results may include duplicates that are non-trivial to reduce. Or perhaps you'll get lucky and there are so few solutions that you can eyeball them.

There are a few things you can do to improve query performance. The easiest one would be to check every condition individually and throw it out if it does not "gain" anything for the counts.

Optimizations

Individual expressions that do not return any positive results are excluded. With the sample data, this reduces the number of query executions from 6370 to 441.

Running the process in parallel may also improve the performance by an order of magnitude. It would probably require a parallel pipelined function.

But even a 100x performance improvement may not help with an NP-complete problem. You may need to find some additional "short cuts" based on your sample data.

It may help to print out the query that generates the test queries, by un-commenting one of the dbms_output.put_line statements. Add a count(*) to see how many queries will be executed and run with a smaller data set to make an estimate for how long it will take.

If the estimate is a billion years, and you can't think of any tricks to make the brute-force method work faster, it may be time to ask this question on https://cstheory.stackexchange.com/

like image 65
Jon Heller Avatar answered Nov 02 '22 23:11

Jon Heller


The idea of the solution is that the columns are independent. So it can be solved column by column.

So you can imagine that you search and build something in multidimensional space. Each column represents a dimension, having values from -inf to +inf. And build the solution dimension by dimension.

  • For the 1st column the solution is: A=1 => false, A=0 => true.
  • Then you add 2nd dimension B. You have 5 values, so dimension on column B be is divided into 6 intervals. Some of the consecutive intervals can be joined. For example <10, 50> and <50,inf> do both imply true.
  • And then you add 3rd dimension.
  • ...

If you want to join dimension intervals on SQL level you can use LAG function. By using partitioning and windowing you order rows by one column. Then you compute a value true/false in a cooked column. And in the next cooked column by using the LAG function you detect whether the true/false flag did change from the previous row.

create table test
(
  b number,
  s number
);

insert into test values(10, -20);
insert into test values(50,  10);
insert into test values(15,  20);
insert into test values(18,  5);

select u.*,
 case when LAG (b, 1, null) OVER (ORDER BY b) = b then 'Y' else 'N' end same_b_value,
 LAG (score_flag, 1, null) OVER (ORDER BY b) AS score_flag_prev,
 case when LAG (score_flag, 1, null) OVER (ORDER BY b) <> score_flag then 'Y' else 'N' end score_flag_changed
from
(
  select t.*,
  case when t.s >= 0 then 'Y' else 'N' end as score_flag
  from test t
)  u
order by b asc;

This query will show that value B=15 is significant because it is where the score_flag changes.

I'm not sure about value B=10 in the question. As this one is linked to both positive and negative score values. Should it be included or excluded then?

like image 2
ibre5041 Avatar answered Nov 03 '22 00:11

ibre5041