I have gathered a set of transactions in a CSV file of the format:
{Pierre, lait, oeuf, beurre, pain}
{Paul, mange du pain,jambon, lait}
{Jacques, oeuf, va chez la crémière, pain, voiture}
I plan to do a simple association rule analysis, but first I want to exclude items from each transactions which do not belong to ReferenceSet = {lait, oeuf, beurre, pain}.
Thus my resulting dataset would be, in my example :
{Pierre, lait, oeuf, beurre, pain}
{Paul,lait}
{Jacques, oeuf, pain,}
I'm sure this is quite simple, but would love to read suggestions/answers to help me a bit.
Another answer references %in%, but in this case intersect is even handier (you may want to look at match, too -- but I think it's documented in the same place as %in%) -- with lapply and intersect we can make the answer into a one-liner:
Data:
> L <- list(pierre=c("lait","oeuf","beurre","pain") ,
+           paul=c("mange du pain", "jambon", "lait"),
+           jacques=c("oeuf","va chez la crémière", "pain", "voiture"))
> reference <- c("lait", "oeuf", "beurre", "pain")
Answer:
> lapply(L,intersect,reference)
$pierre
[1] "lait"   "oeuf"   "beurre" "pain"  
$paul
[1] "lait"
$jacques
[1] "oeuf" "pain"
                        One way is follows (but, as I'm leaving the structure as a matrix I've left NAs where data has been removed (these could be removed if exporting back to CSV); I'm also sure it's possible to do it without loops - this would make it faster (but, IMHO less readable), and I'm sure there's a more efficient way to do the logic too - I'd also be interested in seeing someone's else view on this)
ref <- c("lait","oeuf","beurre","pain")
input <- read.csv("info.csv",sep=",",header=FALSE,strip.white=TRUE)
> input
   V1            V2                  V3     V4      V5
1  Pierre          lait                oeuf beurre    pain
2    Paul mange du pain              jambon   lait        
3 Jacques          oeuf va chez la crémière   pain voiture
input <- as.matrix(input)
output <- matrix(nrow=nrow(input),ncol=ncol(input))
currentRow <- c()
for(i in 1:nrow(input)) {
  j <- 2
  output[i,1]<-input[i,1]
  for(k in 2:length(input[i,])) {
    if(toString(input[i,k]) %in% ref){
      output[i,j] <- toString(input[i,k])
      j<-j+1
    }
  }
}
> output
     [,1]      [,2]   [,3]   [,4]     [,5]  
[1,] "Pierre"  "lait" "oeuf" "beurre" "pain"
[2,] "Paul"    "lait" NA     NA       NA    
[3,] "Jacques" "oeuf" "pain" NA       NA    
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With