Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Correlating varchar values

Is there a built-in way in Oracle 11 to check correlation of values in a varchar2 field? For example, given a simple table such as this:

MEAL_NUM  INGREDIENT
--------------------
1         BEEF
1         CHEESE
1         PASTA
2         CHEESE
2         PASTA
2         FISH
3         CHEESE
3         CHICKEN

I want to get a numerical indication that based on MEAL_NUM, CHEESE is paired mostly with PASTA and to lessening degrees with BEEF, CHICKEN, and FISH.

My first inclination is to use the CORR function and transform the strings into a number perhaps by either enumerating them beforehand or grabbing the rownum from a unique select.

Any suggestions how to go about this?

like image 946
owook Avatar asked Jul 26 '11 18:07

owook


3 Answers

You won't want to use CORR -- if you create a "food number" and assign Beef = 1, Chicken = 2, and Pasta = 3, then a correlation coefficient will tell you whether increased cheese correlates with increased "food number." But the "food number" being higher or lower doesn't mean anything since you made it up. So, don't use CORR unless your foods are actually ordered in some way, like numbers are.

The way statisticians talk about this is with levels of measurement. In the language of the linked article, MEAL_NUM is a nominal measure -- or maybe an ordinal measure if the meals happened in order, but either way, it's a really bad idea to use correlation coefficients on it.

You'll probably instead want to find something like "what percentage of Beef meals also have Cheese?" The following will return, for each ingredient, the number of meals containing it and also the number of meals containing it AND cheese. The trick is that COUNT only counts non-null values.

SELECT Other.Ingredient, 
       COUNT(*) AS TotalMeals, 
       COUNT(Cheese.Ingredient) AS CheesyMeals
     FROM table Other
LEFT JOIN table Cheese
      ON (Cheese.Ingredient = 'Cheese' 
      AND Cheese.Meal_Num = Other.Meal_Num)
GROUP BY Other.Ingredient

Warning: returns wrong results if you include an ingredient twice in any one meal.

Edit: It turns out you aren't interested in Cheese specifically. You really want all the pairs of "correlations." So, we can abstract "Cheese" out and call them just the First and Second ingredients. I've added a "PossibleScore" to this one which tries to act like a percentage-of-meals but doesn't give a strong score if there are very few instances of the ingredient.

SELECT First.Ingredient, 
       Second.Ingredient, 
       COUNT(*) AS MealsWithFirst, 
       COUNT(First.Ingredient) AS MealsWithBoth,
       COUNT(First.Ingredient) / (COUNT(*) + 3) AS PossibleScore,
     FROM table First
LEFT JOIN table Second
      ON (First.Meal_Num = Second.Meal_Num)
GROUP BY First.Ingredient, Second.Ingredient

When sorted by score, this should return

PASTA    CHEESE    2    2    0.400
CHEESE   PASTA     3    2    0.333
BEEF     CHEESE    1    1    0.250
BEEF     PASTA     1    1    0.250
FISH     CHEESE    1    1    0.250
FISH     PASTA     1    1    0.250
CHICKEN  CHEESE    1    1    0.250
PASTA    BEEF      2    1    0.200
PASTA    FISH      2    1    0.200
CHEESE   BEEF      3    1    0.167
CHEESE   FISH      3    1    0.167
CHEESE   CHICKEN   3    1    0.167
like image 153
Chris Cunningham Avatar answered Sep 22 '22 16:09

Chris Cunningham


Do a self join to get all the in ingredient combinations, then corr by the two meal_nums

SELECT t1.INGREDIENT, t2.INGREDIENT, CORR(t1.MEAL_NUM, t2.MEAL_NUM)
FROM TheTable t1, TheTable t2
WHERE t1.INGREDIENT < t2.INGREDIENT
GROUP BY t1.INGREDIENT, t2.INGREDIENT

Should give you something like:

BEEF    CHEESE  0.999
BEEF    PASTA   0.998
CHEESE  PASTA   0.977

UPDATE: as Chris points out, this won't work as is. What I was hoping is that there might be some way to fudge a mapping from the ordinal meal_num to an interval (@Chris, thanks for the link) value. That may not be possible, in which case this answer wouldn't help.

like image 34
James Scriven Avatar answered Sep 20 '22 16:09

James Scriven


Try DBMS_FREQUENT_ITEMSET:

--Create sample data
create table meals(meal_num number, ingredient varchar2(10));

insert into meals
select 1, 'BEEF' from dual union all
select 1, 'CHEESE' from dual union all
select 1, 'PASTA' from dual union all
select 2, 'CHEESE' from dual union all
select 2, 'PASTA' from dual union all
select 2, 'FISH' from dual union all
select 3, 'CHEESE' from dual union all
select 3, 'CHICKEN' from dual;

commit;

--Create nested table type to hold results
CREATE OR REPLACE TYPE fi_varchar_nt AS TABLE OF VARCHAR2(10);
/

--Find the items most frequently combined with CHEESE.
select bt.setid, nt.column_value, support occurances_of_itemset
    ,length, total_tranx
from
(
    select
        cast(itemset as fi_varchar_nt) itemset, rownum setid
        ,support, length, total_tranx
    from table(dbms_frequent_itemset.fi_transactional(
        tranx_cursor => cursor(select meal_num, ingredient from meals),
        support_threshold => 0,
        itemset_length_min => 2,
        itemset_length_max => 2,
        including_items => cursor(select 'CHEESE' from dual),
        excluding_items => null))
) bt,
table(bt.itemset) nt
where column_value <> 'CHEESE'
order by 3 desc;


     SETID COLUMN_VAL OCCURANCES_OF_ITEMSET     LENGTH TOTAL_TRANX
---------- ---------- --------------------- ---------- -----------
         4 PASTA                          2          2           3
         3 FISH                           1          2           3
         1 BEEF                           1          2           3
         2 CHICKEN                        1          2           3
like image 35
Jon Heller Avatar answered Sep 20 '22 16:09

Jon Heller