Best Python clustering library to use for product data analysis [closed]

Tags:

I have a collection of alphanumeric product codes of various products. Similar products have no intrinsic similarity in their codes, ie product code "A123" might mean "Harry Potter Volume 1 DVD" and "B123" might mean "Kellogs Corn Flakes". I also do not actually have the description or identify of the product. All I have is an "owner" of this code. My data, therefore, looks (in a non-normal way) something like this:

Owner1: ProductCodes A123,B124,W555,M221,M556,127,102

Owner2: ProductCode D103,Z552,K112,L3254,223,112

Owner3: ProductCode G123

....

I have huge (ie Terabytes) sets of this data.

I assume that an owner would - for the majority - have an undetermined number of groups of similar products - ie an owner might have just 2 groups - all the DVDs and books of Harry Potter, but also a collection of "Iron Maiden" cds. I would like to analyse this data and determine distance functions between product codes so I can start making assumptions about "how close" product codes are to each other and also cluster product codes (so I can also identify how many groups an owner has). I have started doing some research on textual clustering algorithms but there are numerous ones to choose from and I'm not sure on which one(s) work best with this scenario.

Can someone point me towards the most appropriate python based clustering functions / libraries to use please ?!

764

asked Feb 15 '11 10:02

Richard Green

2 Answers

What you have is a bipartite graph. As an initial stab, it sounds like you are going to treat neighbour lists as zero-one vectors between which you define some kind of similarity/correlation. This could be a normalised Hamming distance for example. Depending on which way you do that you will obtain a graph on a single domain -- either product codes or owners. It will shortly become clear why I've cast everything in the language of graphs, bear with me. Now why do you insist on a Python implementation? Clustering large scale data is time and memory consuming. To pull the cat out of the bag, I have written and still maintain a graph clustering algorithm, used quite widely in bioinformatics. Is is threaded, accepts weighted graphs, and has been used for graphs with millions of nodes and towards a billion of edges. Refer to http://micans.org/mcl/ for more information. Of course, if you trawl stackoverflow and stackexchange there is quite a few threads that may be of interest to you. I would recommend the Louvain method as well, except that I am not sure whether it accepts weighted networks, which you will probably produce.

153

answered Sep 23 '22 19:09

micans

R language has many packages for finding groups in data, and there are python bindings to R, called RPy. R provides several algorithms already mentioned here and also known for good performance on large datasets.

answered Sep 21 '22 19:09

eGlyph

Related questions
                            
                                Installing setuptools in a private version of python
                            
                                wxPython: How to make a TextCtrl fill a Panel
                            
                                Efficient way to create a diagonal sparse matrix
                            
                                Django restart server or httpd
                            
                                Print info about exception in python 2.5?
                            
                                Print variable in python without space or newline [duplicate]
                            
                                pymongo (python+mongodb) drop collection/gridfs?
                            
                                Using multiple memcache servers in a pool
                            
                                Python verify url goes to a page
                            
                                Python equivalent to C#'s using statement [duplicate]
                            
                                Assigning a value to an element of a slice in Python
                            
                                Generating CSV file with Django (dynamic content)
                            
                                Add more sample points to data
                            
                                Any reason there are no returned value from set.add [closed]
                            
                                Stardand context menu in Python TKinter text widget when mouse right button is pressed
                            
                                Python subprocess.Popen as different user on Windows
                            
                                3D scatterplots in sage
                            
                                How to efficiently add sparse matrices in Python
                            
                                Find speed of vehicle from images
                            
                                virtualenv --no-site-packages is not working for me

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Best Python clustering library to use for product data analysis [closed]

Tags:

python

cluster-analysis

Richard Green

People also ask

2 Answers

micans

eGlyph

Recent Activity

Donate For Us