Im pretty much new to data mining and recommendation systems, now trying to build some kind of rec system for users that have such parameters: <ul> <li>city</li> <li>education</li> <li>interest</li> </ul> To calculate similarity between them im gonna apply cosine similarity and discrete similarity. For example: <ul> <li>city : if x = y then d(x,y) = 0. Otherwise, d(x,y) = 1.</li> <li>education : here i will use cosine similarity as words appear in the name of the department or bachelors degree</li> <li>interest : there will be hardcoded number of interest user can choose and cosine similarity will be calculated based on two vectors like this:</li> </ul> <pre class="prettyprint"><code>1 0 0 1 0 0 ... n 1 1 1 0 1 0 ... n </code></pre> where <code>1</code> means the presence of the interest and <code>n</code> is the total number of all interests. My question is: How to combine those 3 similarities in appropriate order? I mean just summing them doesnt sound quite smart, does it? Also I would like to hear comments on my "newbie similarity system", hah.

Here's the usual trick in machine learning. <blockquote> city : if x = y then d(x,y) = 0. Otherwise, d(x,y) = 1. </blockquote> I take this to mean you use a one-of-K coding. That's good. <blockquote> education : here i will use cosine similarity as words appear in the name of the department or bachelors degree </blockquote> You can also use a one-of-K coding here, to produce a vector of size |V| where V is the vocabulary, i.e. all words in your training data. If you now normalize the interest number so that it always falls in the range [0,1], then you can use ordinary L1 (Manhattan) or L2 (Euclidean) distance metrics between your final vectors. The latter corresponds to the cosine similarity metric of information retrieval. Experiment with L1 and L2 to decide which is best.

Combining different similarities to build one final similarity

Tags:

distance

cluster-analysis

similarity

data-mining

Im pretty much new to data mining and recommendation systems, now trying to build some kind of rec system for users that have such parameters:

city
education
interest

To calculate similarity between them im gonna apply cosine similarity and discrete similarity. For example:

city : if x = y then d(x,y) = 0. Otherwise, d(x,y) = 1.
education : here i will use cosine similarity as words appear in the name of the department or bachelors degree
interest : there will be hardcoded number of interest user can choose and cosine similarity will be calculated based on two vectors like this:

1 0 0 1 0 0 ... n
1 1 1 0 1 0 ... n

where 1 means the presence of the interest and n is the total number of all interests.

My question is: How to combine those 3 similarities in appropriate order? I mean just summing them doesnt sound quite smart, does it? Also I would like to hear comments on my "newbie similarity system", hah.

707

asked Nov 20 '11 13:11

Leg0

2 Answers

There are not hard-and-fast answers, since the answers here depend greatly on your input and problem domain. A lot of the work of machine learning is the art (not science) of preparing your input, for this reason. I could give you some general ideas to think about. You have two issues: making meaningful similarities out of each of these items, and then combining them.

The city similarity sounds reasonable but really depends on your domain. Is it really the case that being in the same city means everything, and being in neighboring cities means nothing? For example does being in similarly-sized cities count for anything? In the same state? If they do your similarity should reflect that.

Education: I understand why you might use cosine similarity but that is not going to address the real problem here, which is handling different tokens that mean the same thing. You need "eng" and "engineering" to match, and "ba" and "bachelors", things like that. Once you prepare the tokens that way it might give good results.

Interest: I don't think cosine will be the best choice here, try a simple tanimoto coefficient similarity (just size of intersection over size of union).

You can't just sum them, as I assume you still want a value in the range [0,1]. You could average them. That makes the assumption that the output of each of these are directly comparable, that they're the same "units" if you will. They aren't here; for example it's not as if they are probabilities.

It might still work OK in practice to average them, perhaps with weights. For example, being in the same city here is as important as having exactly the same interests. Is that true or should it be less important?

You can try and test different variations and weights as hopefully you have some scheme for testing against historical data. I would point you at our project, Mahout, as it has a complete framework for recommenders and evaluation.

However all these sorts of solutions are hacky and heuristic. I think you might want to take a more formal approach to feature encoding and similarities. If you're willing to buy a book and like Mahout, Mahout in Action has good coverage in the clustering chapters on how to select and encode features and then how to make one similarity out of them.

answered Oct 22 '22 16:10

Sean Owen

Here's the usual trick in machine learning.

city : if x = y then d(x,y) = 0. Otherwise, d(x,y) = 1.

I take this to mean you use a one-of-K coding. That's good.

education : here i will use cosine similarity as words appear in the name of the department or bachelors degree

You can also use a one-of-K coding here, to produce a vector of size |V| where V is the vocabulary, i.e. all words in your training data.

If you now normalize the interest number so that it always falls in the range [0,1], then you can use ordinary L1 (Manhattan) or L2 (Euclidean) distance metrics between your final vectors. The latter corresponds to the cosine similarity metric of information retrieval.

Experiment with L1 and L2 to decide which is best.

answered Oct 22 '22 17:10

Fred Foo

Related questions
                            
                                sklearn.mixture.DPGMM: Unexpected results
                            
                                How to recognize a interior node having all its containing points in one cluster in a ball tree when doing k-means algorithm?
                            
                                Drawing clustered graphs in Python
                            
                                Error in La.svd(x, nu, nv) : error code 1 from Lapack routine 'dgesdd' when using stability function in ClustOfVar
                            
                                Trajectory Clustering/ Aggregation with Python
                            
                                Affinity propagation preference parameter
                            
                                Clustering while trying to minimise spare capacity
                            
                                GSDMM Convergence of Clusters (Short Text Clustering)
                            
                                Comparing sets of 2D data/scatterplots
                            
                                is there any seqFileDir option for "clusterdump" in the latest "apache mahout" library?
                            
                                unsupervised semantic clustering of phrases
                            
                                Is there any kind of subspace clustering package available in scikit-learn
                            
                                Matlab - Gaussian mixture and Fuzzy C-means less accurate than K-means on high-dimensional data (image of 26-dimension vectors)
                            
                                Online clustering of news articles
                            
                                Clustering with scipy - clusters via distance matrix, how to get back the original objects
                            
                                mahalanobis distance in Kmeans Clustering using OpenCV
                            
                                OpenCV-Python: How to detect a hotspot in thermal image?
                            
                                Converting igraph to networkx for clustering
                            
                                Choice of Machine Learning Platform [closed]
                            
                                Which is the best document clustering open-source package?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With