Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding 1-itemset, 2-itemset...n-itemset from xml document

<?xml version="1.0" encoding="utf-8"?>
<transaction>
    <itemset id="1">
        <item>pen</item>
        <item>notebook</item>
        <item>pencile box</item>
    </itemset>
<itemset id="2">
    <item>pen</item>
    <item>notebook</item>
    <item>pencile box</item>
    <item>ink</item>
</itemset>
<itemset id="3">
    <item>ink</item>
    <item>milk</item>
</itemset>
<itemset id="4">
    <item>pen</item>
    <item>notebook</item>
    <item>pencile box</item>
    <item>ink</item>
    <item>milk</item>
    <item>paper</item>
</itemset>
<itemset id="5">
    <item>pen</item>
    <item>ink</item>
    <item>notebook</item>
</itemset>
</transaction>

I am new in XQuery. I am using BaseX to run XQuery. My target is to find 1- frequent 2-frequent itemset... n-frequent itemset based on very popular Apriori algorithm from each transaction and want to count them.

I have already done to find 1-frequent itemset and their support count. But I can't find the 2-itemset and their support count using XQuery. Could anyone please help me to do that? Here is the code I've tried...

let $src:= doc('XML/test.xml')/transaction/itemset
let $minsup:= 3
let $C:=distinct-values($src/*)
let $l:=for $itemset in $C
        let $items:=(for $item in $src/*
                     where $itemset=$item
                     return $item)
                     let $sup := count($items)
        where $sup>=$minsup
        order by $sup descending
return<itemset>
      <item> {$itemset} </item>
      <support> {$sup} </support>
      </itemset>

let $C1:=distinct-values($l/item)
let $C2:=distinct-values($l/item)
let $cnt:=count(distinct-values($l/item))
let $counter:=0
for $a in $C1
    for $b in $C2
    where $a>$b
return <items>
       <item>{concat($a,',',$b)}</item>
       </items>

Expected output looks like

<itemset>
  <item>pen</item>
  <support>4</support>
</itemset>
<itemset>
  <item>notebook</item>
  <support>4</support>
</itemset>
...............
<itemset>
  <item>pen</item>
  <item>notebook</item>
  <support>4</support>
</itemset>
<itemset>
  <item>pen</item>
  <item>pencile box</item>
  <support>3</support>
</itemset>
................
<itemset>
  <item>pen</item>
  <item>notebook</item>
  <item>pencile box</item>
  <support>3</support>
</itemset>
<itemset>
  <item>pen</item>
  <item>notebook</item>
  <item>ink</item>
  <support>3</support>
</itemset>
...............
like image 240
utij2004 Avatar asked Nov 02 '22 10:11

utij2004


1 Answers

It's not entirely clear what you are trying to do. What I say here is based on the belief that you want is a frequency list, first of individual item values, then pairs of items, then triples, then n-tuples for n > 3. Or, more abstractly: for $n = 1 to the number of distinct item values in the input, find each selection of $n distinct values from the set of distinct item values that occurs more than $minsup times.

(If that's not a correct description of the problem, then nothing I say from here on out is going to be relevant. If it is a correct description, then I congratulate you: it's a nice problem. It feels a bit like homework intended to force you to think about meta-programming or recursion; if so, I think it's good form to mention the fact.)

You have more than one challenge here. If it were my problem, I'd start with the case of $n = 1. The first part of your code calculates the result correctly for $n = 1, as the value of $l, so you're there already.

Then I'd write code by hand for $n = 2, $n = 3, etc., until I felt comfortable that I had understood the pattern. (In practice, I stopped after n = 4. Cleverer programmers than I would stop earlier.) You ask specifically about the case of $n = 2. The following code is similar to yours in some ways, but differs in detail. I've renamed some variables to make it easier for me to read.

let $src:= doc('.../utij.xml')/transaction
let $minsup:= 3
let $allitems := $src/itemset/item
let $C:=distinct-values($allitems)

let $n1 := for $value in $C
           let $itemsets := $src/itemset[item = $value]
           let $sup := count($itemsets)
           where $sup>=$minsup
           order by $sup descending
           return <itemset>
                    <item> {$value} </item>
                    <support> {$sup} </support>
                  </itemset>

let $n2 := for $v1 in $C, $v2 in $C
           let $itemsets := $src/itemset
                            [ item = $v1 
                              and item = $v2 
                              and $v1 gt $v2
                            ]
           let $sup := count($itemsets)
           where $sup ge $minsup
           order by $sup descending, $v1 ascending, $v2 ascending
           return <itemset>
                    <item>{$v1}</item>
                    <item>{$v2}</item>
                    <support>{$sup}</support>
                  </itemset>

return ($n1, '&#xA;........&#xA;', $n2)

Note that I've re-cast the $n=1 case to count itemset elements containing an item with the appropriate value, not item elements themselves. This makes the two blocks of code more similar, and should make the extension to $n greater than 2 obvious.

Then I'd stop and consider whether my code for $n = 1, 2, 3, ... can be generalized. Since (as your sketch already reflects) the straightforward way to write the code involves one variable ranging across all values for $n = 1, two variables for $n = 2, and i variables for $n = i, the answer is: this is going to be hard to generalize in the naive way. So I'd think about a function which accepts two inputs: first a sequence of itemset elements (the $src variable in my example) and second a sequence of n-tuples. As output, my function will return a sequence of n+1-tuples. (At this point, I'd add a third argument, for $C, to avoid having to recalculate it each time.) A second function can be set to call the first function for each value between 1 and the number of distinct item values in the input, and return a concatenation of the results.

like image 67
C. M. Sperberg-McQueen Avatar answered Nov 13 '22 22:11

C. M. Sperberg-McQueen