I have a collection of alphanumeric product codes of various products. Similar products have no intrinsic similarity in their codes, ie product code "A123" might mean "Harry Potter Volume 1 DVD" and "B123" might mean "Kellogs Corn Flakes". I also do not actually have the description or identify of the product. All I have is an "owner" of this code. My data, therefore, looks (in a non-normal way) something like this:
Owner1: ProductCodes A123,B124,W555,M221,M556,127,102
Owner2: ProductCode D103,Z552,K112,L3254,223,112
Owner3: ProductCode G123
....
I have huge (ie Terabytes) sets of this data.
I assume that an owner would - for the majority - have an undetermined number of groups of similar products - ie an owner might have just 2 groups - all the DVDs and books of Harry Potter, but also a collection of "Iron Maiden" cds. I would like to analyse this data and determine distance functions between product codes so I can start making assumptions about "how close" product codes are to each other and also cluster product codes (so I can also identify how many groups an owner has). I have started doing some research on textual clustering algorithms but there are numerous ones to choose from and I'm not sure on which one(s) work best with this scenario.
Can someone point me towards the most appropriate python based clustering functions / libraries to use please ?!
Pandas. Pandas (Python data analysis) is a must in the data science life cycle. It is the most popular and widely used Python library for data science, along with NumPy in matplotlib.
The scikit-learn library provides a suite of different clustering algorithms to choose from. A list of 10 of the more popular algorithms is as follows: Affinity Propagation. Agglomerative Clustering.
Pandas is a perfect tool for data wrangling or munging. It is designed for quick and easy data manipulation, reading, aggregation, and visualization. Pandas take data in a CSV or TSV file or a SQL database and create a Python object with rows and columns called a data frame.
What you have is a bipartite graph. As an initial stab, it sounds like you are going to treat neighbour lists as zero-one vectors between which you define some kind of similarity/correlation. This could be a normalised Hamming distance for example. Depending on which way you do that you will obtain a graph on a single domain -- either product codes or owners. It will shortly become clear why I've cast everything in the language of graphs, bear with me. Now why do you insist on a Python implementation? Clustering large scale data is time and memory consuming. To pull the cat out of the bag, I have written and still maintain a graph clustering algorithm, used quite widely in bioinformatics. Is is threaded, accepts weighted graphs, and has been used for graphs with millions of nodes and towards a billion of edges. Refer to http://micans.org/mcl/ for more information. Of course, if you trawl stackoverflow and stackexchange there is quite a few threads that may be of interest to you. I would recommend the Louvain method as well, except that I am not sure whether it accepts weighted networks, which you will probably produce.
R language has many packages for finding groups in data, and there are python bindings to R, called RPy. R provides several algorithms already mentioned here and also known for good performance on large datasets.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With