Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Document Clustering in python using SciKit

I recently started working on Document clustering using SciKit module in python. However I am having a hard time understanding the basics of document clustering.

What I know ?

  • Document clustering is typically done using TF/IDF. Which essentially converts the words in the documents to vector space model which is then input to the algorithm.
  • There are many algorithms like k-means, neural networks, hierarchical clustering to accomplish this.

My Data :

  • I am experimenting with linkedin data, each document would be the linkedin profile summary, I would like to see if similar job documents get clustered together.

Current Challenges:

  • My data has huge summary descriptions, which end up becoming 10000's of words when I apply TF/IDF. Is there any proper way to handle this high dimensional data.
  • K - means and other algorithms requires I specify the no. of clusters ( centroids ), in my case I do not know the number of clusters upfront. This I believe is a completely unsupervised learning. Are there algorithms which can determine the no. of clusters themselves?
  • I've never worked with document clustering before, if you are aware of tutorials , textbooks or articles which address this issue, please feel free to suggest.

I went through the code on SciKit webpage, it consists of too many technical words which I donot understand, if you guys have any code with good explanation or comments please share. Thanks in advance.

like image 380
Siddarth Avatar asked Feb 10 '23 17:02

Siddarth


1 Answers

My data has huge summary descriptions, which end up becoming 10000's of words when I apply TF/IDF. Is there any proper way to handle this high dimensional data.

My first suggestion is that you don't unless you absolutely have to, due to memory or execution time problems.

If you must handle it, you should use dimensionality reduction (PCA for example) or feature selection (probably better in your case, see chi2 for example)

K - means and other algorithms requires I specify the no. of clusters ( centroids ), in my case I do not know the number of clusters upfront. This I believe is a completely unsupervised learning. Are there algorithms which can determine the no. of clusters themselves?

If you look at the clustering algorithms available in scikit-learn, you'll see that not all of them require that you specify the number of clusters.

Another one that does not is hierarchical clustering, implemented in scipy. Also see this answer.

I would also suggest that you use KMeans and try to manually tweak the number of clusters until you are satisfied with the results.

I've never worked with document clustering before, if you are aware of tutorials , textbooks or articles which address this issue, please feel free to suggest.

Scikit has a lot of tutorials for working with text data, just use the "text data" search query on their site. One is for KMeans, others are for supervised learning, but I suggest you go over those too to get more familiar with the library. From a coding, style and syntax POV, unsupervised and supervised learning are pretty similar in scikit-learn, in my opinion.

Document clustering is typically done using TF/IDF. Which essentially converts the words in the documents to vector space model which is then input to the algorithm.

Minor correction here: TF-IDF has nothing to do with clustering. It is simply a method for turning text data into numerical data. It does not care what you do with that data (clustering, classification, regression, search engine things etc.) afterwards.

I understand the message you were trying to get across, but it is incorrect to say that "clustering is done using TF-IDF". It's done using a clustering algorithm, TF-IDF only plays a preprocessing role in document clustering.

like image 193
IVlad Avatar answered Feb 13 '23 05:02

IVlad