I've got a two-column dataset with about 30000 clusters and 10 factors like this:
cluster-1 Factor1
cluster-1 Factor2
...
cluster-2 Factor2
cluster-2 Factor3
...
And I would like to represent the co-occurrence of factors in the clusterset. Something like "Factor1+Factor3+Factor5 in 1234 clusters", and so on for the different combinations. I thought I could so something like a pie chart, but with 10 factors, I take there can be too many combinations.
What would be a good way of representing this?
There is one good programming question in here that should be addressed:
How do I count the number of co-occurrences of factors in the different clusters?
First simulate some data:
n = 1000
set.seed(12345)
n.clusters = 100
clusters = rep(1:n.clusters, length.out=n)
n.factors = 10
factors = round(rnorm(n, n.factors/2, n.factors/5))
factors[factors > n.factors] = n.factors
factors[factors < 1] = 1
data = data.frame(cluster=clusters, factor=factors)
> data
cluster factor
1 1 6
2 2 6
3 3 5
4 4 4
5 5 6
6 6 1
...
Then here is the code that could be used to tabulate the number of times each combination of factors occurs in the clusters:
counts = with(data, table(tapply(factor, cluster, function(x) paste(as.character(sort(unique(x))), collapse=''))))
This can be represented as a simple pie chart, for example,
dev.new(width=5, height=5)
pie(counts[counts>1])
but simple counts like this are often most efficiently displayed as a sorted table. For more on this, check out Edward Tufte.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With