Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get items for both LHS and RHS for only specific columns in arules?

Tags:

r

arules

apriori

Within the apriori function, I want the outcome to only contain these two variables in the LHS HouseOwnerFlag=0 and HouseOwnerFlag=1. The RHS should only contain attributes from the column Product. For instance:

#   lhs                   rhs                                          support confidence     lift
# 1 {HouseOwnerFlag=0}    => {Product=SV 16xDVD M360 Black}            0.2500000  0.2500000 1.000000
# 2 {HouseOwnerFlag=1}    => {Product=Adventure Works 26" 720p}        0.2500000  0.2500000 1.000000
# 3 {HouseOwnerFlag=0}    => {Product=Litware Wall Lamp E3015 Silver}  0.1666667  0.3333333 1.333333
# 4 {HouseOwnerFlag=1}    => {Product=Contoso Coffee Maker 5C E0900}   0.1666667  0.3333333 1.333333

Part of the answer is solved in this question: R arules, mine only rules from specific column

So now I use the following:
rules <- apriori(sales, parameter=list(support =0.01, confidence =0.8, minlen=2), appearance = list(lhs=c("HouseOwnerFlag=0", "HouseOwnerFlag=1")))

Then I use this from that other SO question to ensure that only the Product column is on the RHS:
inspect( subset( rules, subset = rhs %pin% "Product=" ) )

The outcome is like this:

#   lhs                                                                  rhs                                          support confidence     lift
# 1 {ProductKey=153, IncomeGroup=Moderate, BrandName=Adventure Works }    => {Product=SV 16xDVD M360 Black}            0.2500000  0.2500000 1.000000
# 2 {ProductKey=176, MaritalStatus=M, ProductCategoryName=TV and Video }  => {Product=Adventure Works 26" 720p}        0.2500000  0.2500000 1.000000
# 3 {BrandName=Southridge Video, NumberChildrenAtHome=0 }                 => {Product=Litware Wall Lamp E3015 Silver}  0.1666667  0.3333333 1.333333
# 4 {HouseOwnerFlag=1, BrandName=Southridge Video, ProductKey=170 }       => {Product=Contoso Coffee Maker 5C E0900}   0.1666667  0.3333333 1.333333

So apparently the LHS is able to contain every possible column, not just HouseOwnerFlag like I specified. From other stackoverflow questions, I see that I can put default="rhs" in the apriori function, like so:
rules <- apriori(sales, parameter=list(support =0.001, confidence =0.5, minlen=2), appearance = list(lhs=c("HouseOwnerFlag=0", "HouseOwnerFlag=1"), default="rhs"))

Then upon inspecting (without the subset part, just inspect(rules), there are far less rules (7) than before but it does indeed only contain HouseOwnerFlag in the LHS:

#   lhs                   rhs                           support     confidence lift
# 1 {HouseOwnerFlag=0}    => {MaritalStatus=S}          0.2500000  0.2500000   1.000000
# 2 {HouseOwnerFlag=1}    => {Gender=M}                 0.2500000  0.2500000   1.000000
# 3 {HouseOwnerFlag=0}    => {NumberChildrenAtHome=0}   0.1666667  0.3333333   1.333333
# 4 {HouseOwnerFlag=1}    => {Gender=M}   0.1666667     0.3333333  1.333333

However on the RHS there's nothing from the column Product in the RHS. So it has no use to inspect it with subset as ofcourse it would return null. I tested it several times with different support numbers to experiment and see if Product would appear or not, but the 7 same rules remain the same.

So my question is, how can I specify both the LHS (HouseOwnerFlag) and RHS (Product)? What am I doing wrong?

EDIT: You can reproduce this problem by downloading this testdataset from https://www.dropbox.com/s/tax5xalac5xgxtf/testdf.txt?dl=0 Mind you, I only took the first 20 rows from a huge dataset, so the output here won't have the same product names as the example I displayed above unfortunately. But the problem still remains the same. I want to be able to get only HouseOwnerFlag=0and/or HouseOwnerFlag=1 on the LHS and the column Product on the RHS.

like image 761
Kim Avatar asked Jan 13 '15 15:01

Kim


2 Answers

It seems that one can't constrain lhs and rhs at once (I also did not before playing with your data). But you can use subset. EDIT: I was wrong, you can also constrain lhs and rhs at once, see below for another solution. I keep Solution 1 because in some cases it might be useful to compute a bigger set and then split by the left hand side.

Solution 1:

rules_sales <- apriori(sales, 
                   parameter=list(support =0.001, confidence =0.5, minlen=2, maxlen=2), 
                   appearance = list(lhs=c("HouseOwnerFlag=0", "HouseOwnerFlag=1"), 
                                     default="rhs"))

rules_subset <- subset(rules_sales, (rhs %in% paste0("Product=", unique(sales$Product))))
inspect(rules_subset)

gives:

  lhs                   rhs                                                support confidence lift
1 {HouseOwnerFlag=0} => {Product=SV DVD Movies E100 Yellow}                   0.05        0.5   10
2 {HouseOwnerFlag=0} => {Product=Fabrikam Refrigerator 4.6CuFt E2800 Grey}    0.05        0.5    5
3 {HouseOwnerFlag=1} => {Product=Contoso SLR Camera M144 Gold}                0.10        0.5    5

But you should be careful about your low support:

Warning in apriori(sales, parameter = list(support = 0.001, confidence = 0.5,  :
  You chose a very low absolute support count of 0. You might run out of memory! Increase minimum support.

Solution 2:

I was tricked by the definition of the parameter default. Using lhs and rhs at once tells each item that is assigned to one of them, that it can only be used for lhs/rhs. The parameter "default" is automatically set to "both" and all other items not used in lhs/rhs can be used for both (Explanation of the appearence parameter as implemented in the R package: http://www.inside-r.org/node/86290, I realised that it must be possible when reading the manual of the original C implementation: http://www.borgelt.net/doc/apriori/apriori.html#appearin). You have to set default="none" then you can constrain lhs and rhs without using a subset later.

rules_sales <- apriori(sales, 
                       parameter=list(support =0.001, confidence =0.5, minlen=2, maxlen=2), 
                       appearance = list(lhs=c("HouseOwnerFlag=0", "HouseOwnerFlag=1"), 
                       rhs=paste0("Product=", unique(sales$Product)), default="none"))
like image 160
Verena Haunschmid Avatar answered Nov 14 '22 21:11

Verena Haunschmid


I am very late to the party... but as I am also playing now with the package, let me include my thoughts in case is helpful for someone.

The rules included in the output are the ones that are compliant with the support and confidence parameters. So, if you don't have any rules with the format you expect try relax these constraints: lower support, lower confidence. The lhs, as far as I have found can only contain one term, so you could restrict this part to the terms you want to appear (Product) in order to speed up the rules generation. I haven't tried on your specific dataset but I think this is general advise that should work in all cases.

like image 31
Picarus Avatar answered Nov 14 '22 22:11

Picarus