Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best Python clustering library to use for product data analysis [closed]

I have a collection of alphanumeric product codes of various products. Similar products have no intrinsic similarity in their codes, ie product code "A123" might mean "Harry Potter Volume 1 DVD" and "B123" might mean "Kellogs Corn Flakes". I also do not actually have the description or identify of the product. All I have is an "owner" of this code. My data, therefore, looks (in a non-normal way) something like this:

Owner1: ProductCodes A123,B124,W555,M221,M556,127,102

Owner2: ProductCode D103,Z552,K112,L3254,223,112

Owner3: ProductCode G123

....

I have huge (ie Terabytes) sets of this data.

I assume that an owner would - for the majority - have an undetermined number of groups of similar products - ie an owner might have just 2 groups - all the DVDs and books of Harry Potter, but also a collection of "Iron Maiden" cds. I would like to analyse this data and determine distance functions between product codes so I can start making assumptions about "how close" product codes are to each other and also cluster product codes (so I can also identify how many groups an owner has). I have started doing some research on textual clustering algorithms but there are numerous ones to choose from and I'm not sure on which one(s) work best with this scenario.

Can someone point me towards the most appropriate python based clustering functions / libraries to use please ?!

like image 764
Richard Green Avatar asked Feb 15 '11 10:02

Richard Green


People also ask

Which Python library is used for data analysis?

Pandas. Pandas (Python data analysis) is a must in the data science life cycle. It is the most popular and widely used Python library for data science, along with NumPy in matplotlib.

Which library is used for clustering in Python?

The scikit-learn library provides a suite of different clustering algorithms to choose from. A list of 10 of the more popular algorithms is as follows: Affinity Propagation. Agglomerative Clustering.

Which Python library would you prefer to use for data Munging?

Pandas is a perfect tool for data wrangling or munging. It is designed for quick and easy data manipulation, reading, aggregation, and visualization. Pandas take data in a CSV or TSV file or a SQL database and create a Python object with rows and columns called a data frame.


2 Answers

What you have is a bipartite graph. As an initial stab, it sounds like you are going to treat neighbour lists as zero-one vectors between which you define some kind of similarity/correlation. This could be a normalised Hamming distance for example. Depending on which way you do that you will obtain a graph on a single domain -- either product codes or owners. It will shortly become clear why I've cast everything in the language of graphs, bear with me. Now why do you insist on a Python implementation? Clustering large scale data is time and memory consuming. To pull the cat out of the bag, I have written and still maintain a graph clustering algorithm, used quite widely in bioinformatics. Is is threaded, accepts weighted graphs, and has been used for graphs with millions of nodes and towards a billion of edges. Refer to http://micans.org/mcl/ for more information. Of course, if you trawl stackoverflow and stackexchange there is quite a few threads that may be of interest to you. I would recommend the Louvain method as well, except that I am not sure whether it accepts weighted networks, which you will probably produce.

like image 153
micans Avatar answered Sep 23 '22 19:09

micans


R language has many packages for finding groups in data, and there are python bindings to R, called RPy. R provides several algorithms already mentioned here and also known for good performance on large datasets.

like image 21
eGlyph Avatar answered Sep 21 '22 19:09

eGlyph