I am working on a java-oracle based project where I stuck with an problem which seems to me requires an analytic solution. I am looking for solution either based on SQL query or any algorithm or any free analytic tool which I can follow to get desired results.
Problem statement: Lets us say I have below table with columnA-D and last column as Score, I want to find an criteria on values for each of the columns which when combined in SQL where clause will always give me positive value for Score column. So basically what combination of columnA-D will always give me positive score?
columnA|columnB|columnC|columnD|Score
1 40 10 3 -20
0 40 2 3 10
0 10 3 3 20
1 15 3 3 -5
0 10 2 2 -15
0 15 6 3 -10
Expected result for above data set:- Visual interpretation of above data set gives me condition: “ColumnA =0 and columnB >10 and columnC <5 will ensure score always >0”. (visually its clear columnD does not have an effect).
Please note above data set is for sake of simplicity. In reality, my project contains around 40 columns with almost 2500 rows. One thing is for sure each of columns have finite range of values.
Here is an algorithm I started with (need inputs to refine it further if someone thinks I am in right direction):
Preparation: Create an list of all possible expressions like A=0, B>10,C<5 (for 40 columns, I finalized total approx 150 expressions)
Let us call it "expressions" variable.
Alogrithm for 1st run:
set totalPositiveRows= select count(*) from my tables where score>0;
set totalNegativeRows= select count(*) from my tables where score<0;
For each expr in expressions, calculate following three variables set positivePercentage= find percentage of totalPositiveRows which satisfy this expr; //like if 60 rows out of total 100 rows having score>0 satisfy expr , then positivePercentage=60%
set negativePercentage= find percentage of totalNegativeRows which satisfy this expr; //like if 40 rows out of total 100 rows having score<0 satisfy expr , then negativePercentage=40%
set diffPercentage=positivePercentage-negativePercentage;
Set initialexpr=Choose expr having maximum value of diffPercentage set initalPositivePercentage=choose corresponding positivePercentage value; set initalNegativePercentage=choose corresponding negativePercentage value; My thinking is that I need to now keep expanding initalexpr until initalNegativePercentage becomes 0.
Alogrithm for subsequent runs until initalNegativePercentage becomes 0:-
For each expr in expressions, calculate following three variables
set newexpr=initialexpr+" and "+expr;
set positivePercentage= find percentage of totalPositiveRows which satisfy newexpr;
set negativePercentage= find percentage of totalNegativeRows which satisfy newexpr;
//calculate how much negative percentage it has reduced? set positiveReduction=initalPositivePercentage-positivePercentage; set negativeReduction=initalNegativePercentage-negativePercentage; if(negativeReduction>=positiveReduction) //note it down else //discard it
Choose the expr which gives maxium negative reduction, that becomes new inital expr. Set initialexpr=Choose expr having maximum value of negativeReduction set initalPositivePercentage=choose corresponding value; set initalNegativePercentage=choose corresponding value;
Repeat the algorithm above.
Please comment.
You can list a table's columns with the mysqlshow db_name tbl_name command. The DESCRIBE statement provides information similar to SHOW COLUMNS .
You can use an asterisk character, *, to retrieve all the columns. In queries where all the data is found in one table, the FROM clause is where we specify the name of the table from which to retrieve rows.
By far the simplest and most straightforward method for ensuring a particular column's result set doesn't contain NULL values is to use the IS NOT NULL comparison operator.
Every movie in the movie has a genre (action, comedy, adventure), but some movies have not yet been classified by genre. These records have NULL values in the genres column. In the genres table, there is a matching ID for NULL. So, the movies table has the following columns: movieid , title , and genres .
Below is a brute-force solution. This might also be a good question for the theoretical computer science site. I think this is an NP-complete problem similar to Boolean satisfiability, but that's just a wild guess. There may be a smarter way to solve this but I don't think I'll find it.
The basic idea is to cross join every expression with every distinct value for a column, and then cross join all the columns. The table will be queried with every expression list, and a count generated for positive and negative scores. If the expression returns the expected number of positive scores and none of the negative scores it is valid.
This assumes that you are only using the expressions >
, <
, and =
. Every new column or expression will make this problem exponentially slower.
Test schema
drop table table1;
create table table1(a number, b number, c number, d number, score number);
insert into table1
select 1, 40, 10, 3, -20 from dual union all
select 0, 40, 2, 3, 10 from dual union all
select 0, 10, 3, 3, 20 from dual union all
select 1, 15, 3, 3, -5 from dual union all
select 0, 10, 2, 2, -15 from dual union all
select 0, 15, 6, 3, -10 from dual;
Wall of code
declare
v_inline_view varchar2(32767);
v_inline_views clob;
v_inline_view_counter number := 0;
v_and_expression varchar2(32767);
v_query clob;
v_sqls sys.odcivarchar2list;
v_dynamic_query_counter number := 0;
begin
--#1: Create inline views.
--One for every combination of expression and distinct value, per column.
for inline_views in
(
--Inline view for every possible expression for each column.
select
replace(q'[
(
select *
from
(
--Possible expressions.
select distinct
case
when operator is null then null
else ' AND %%COLUMN%% '||operator||' '||%%COLUMN%%
end %%COLUMN%%_expression
from
--All operators.
(
select '>' operator from dual union all
select '<' operator from dual union all
select '=' operator from dual union all
select null operator from dual
)
--All distinct values.
cross join
(
select distinct %%COLUMN%% from table1
)
)
--where %%COLUMN%%_expression is null or %%COLUMN%%_expression %%EXPRESSION_PERFORMANCE_EXCLUSIONS%%
)
]', '%%COLUMN%%', column_name) inline_view
from user_tab_columns
where table_name = 'TABLE1'
and column_name <> 'SCORE'
order by column_name
) loop
--Assign to temorary so it can be modified.
v_inline_view := inline_views.inline_view;
--#1A: Optimize inline view - throw out expressions if they don't return any positive results.
declare
v_expressions sys.odcivarchar2list;
v_expressions_to_ignore varchar2(32767);
v_has_score_gt_0 number;
begin
--Gather expressions for one column.
execute immediate v_inline_view bulk collect into v_expressions;
--Loop through and test each expression.
for i in 1 .. v_expressions.count loop
--Always keep the null expression.
if v_expressions(i) is not null then
--Count the number of rows with a positive score.
execute immediate 'select nvl(max(case when score > 0 then 1 else 0 end), 0) from table1 where '||replace(v_expressions(i), ' AND ', null)
into v_has_score_gt_0;
--If the expression returns nothing positive then add it to exclusion.
if v_has_score_gt_0 = 0 then
v_expressions_to_ignore := v_expressions_to_ignore||','''||v_expressions(i)||'''';
end if;
end if;
end loop;
--Convert it into an IN clause.
if v_expressions_to_ignore is not null then
--Remove comment, replace placeholder with expression exclusions.
v_inline_view := regexp_replace(v_inline_view, '(.*)(--where)( .* )(%%EXPRESSION_PERFORMANCE_EXCLUSIONS%%)(.*)', '\1where\3 not in ('||substr(v_expressions_to_ignore, 2)||')');
end if;
end;
--Aggregate and count inline views.
if v_inline_view_counter <> 0 then
v_inline_views := v_inline_views||'cross join';
end if;
v_inline_views := v_inline_views||v_inline_view;
v_inline_view_counter := v_inline_view_counter + 1;
end loop;
--#2: Create an AND expression to combine all column expressions.
select listagg(column_name||'_expression', '||') within group (order by column_name)
into v_and_expression
from user_tab_columns
where table_name = 'TABLE1'
and column_name <> 'SCORE';
--#3: Create a that will create all possible expression combinations.
v_query :=
replace(replace(q'[
--8281 combinations
select '
select
'''||expressions||''' expression,
nvl(sum(case when score > 0 then 1 else 0 end), 0) gt_0_score_count,
nvl(sum(case when score <= 0 then 1 else 0 end), 0) le_0_score_count
from table1
where 1=1 '||expressions v_sql
from
(
--Combine expressions
select %%AND_EXPRESSION%% expressions
from
%%INLINE_VIEWS%%
) combined_expressions
]', '%%AND_EXPRESSION%%', v_and_expression), '%%INLINE_VIEWS%%', v_inline_views);
--TEST: It might be useful to see the full query here.
--dbms_output.put_line(v_query);
--#4: Gather expressions.
--With larger input you'll want to use a LIMIT
execute immediate v_query
bulk collect into v_sqls;
--#5: Test each expression.
--Look for any queries that return the right number of rows.
for i in 1 .. v_sqls.count loop
declare
v_expression varchar2(4000);
v_gt_0_score_count number;
v_le_0_score_count number;
begin
execute immediate v_sqls(i) into v_expression, v_gt_0_score_count, v_le_0_score_count;
v_dynamic_query_counter := v_dynamic_query_counter + 1;
--TODO: Dynamically generate "2".
if v_gt_0_score_count = 2 and v_le_0_score_count = 0 then
dbms_output.put_line('Expression: '||v_expression);
end if;
exception when others then
dbms_output.put_line('Problem with: '||v_sqls(i));
end;
end loop;
dbms_output.put_line('Queries executed: '||v_dynamic_query_counter);
end;
/
Results
The results appear correct. They are slightly different than yours because "columnB > 10" is incorrect.
Expression: AND A = 0 AND C < 6 AND D = 3
Expression: AND A = 0 AND C < 6 AND D > 2
Expression: AND A < 1 AND C < 6 AND D = 3
Expression: AND A < 1 AND C < 6 AND D > 2
Queries executed: 441
Problems
This brute-force approach is extremely inefficient in at least two ways. Even for this simple example it requires 6370 queries.. And the results may include duplicates that are non-trivial to reduce. Or perhaps you'll get lucky and there are so few solutions that you can eyeball them.
There are a few things you can do to improve query performance. The easiest one would be to check every condition individually and throw it out if it does not "gain" anything for the counts.
Optimizations
Individual expressions that do not return any positive results are excluded. With the sample data, this reduces the number of query executions from 6370 to 441.
Running the process in parallel may also improve the performance by an order of magnitude. It would probably require a parallel pipelined function.
But even a 100x performance improvement may not help with an NP-complete problem. You may need to find some additional "short cuts" based on your sample data.
It may help to print out the query that generates the test queries, by un-commenting one of the dbms_output.put_line
statements. Add a count(*)
to see how many queries will be executed and run with a smaller data set to make an estimate for how long it will take.
If the estimate is a billion years, and you can't think of any tricks to make the brute-force method work faster, it may be time to ask this question on https://cstheory.stackexchange.com/
The idea of the solution is that the columns are independent. So it can be solved column by column.
So you can imagine that you search and build something in multidimensional space. Each column represents a dimension, having values from -inf
to +inf
. And build the solution dimension by dimension.
A=1 => false, A=0 => true.
If you want to join dimension intervals on SQL level you can use LAG function. By using partitioning and windowing you order rows by one column. Then you compute a value true/false in a cooked column. And in the next cooked column by using the LAG function you detect whether the true/false flag did change from the previous row.
create table test
(
b number,
s number
);
insert into test values(10, -20);
insert into test values(50, 10);
insert into test values(15, 20);
insert into test values(18, 5);
select u.*,
case when LAG (b, 1, null) OVER (ORDER BY b) = b then 'Y' else 'N' end same_b_value,
LAG (score_flag, 1, null) OVER (ORDER BY b) AS score_flag_prev,
case when LAG (score_flag, 1, null) OVER (ORDER BY b) <> score_flag then 'Y' else 'N' end score_flag_changed
from
(
select t.*,
case when t.s >= 0 then 'Y' else 'N' end as score_flag
from test t
) u
order by b asc;
This query will show that value B=15
is significant because it is where the score_flag changes.
I'm not sure about value B=10
in the question. As this one is linked to both positive and negative score values. Should it be included or excluded then?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With