Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Has there been any recent breakthrough in text stream clustering algorithm based on similarity?

I need to have a lightweight tool for text stream clustering. Lightweight in the sense that it doesn't have memory so that it can remember the previous text entries. Text stream here means continuous feed of alphanumeric and semi structured sentences/phrases eg: logs of any application. similarity based clustering means that the algorithm should cluster the texts in groups having the pattern similarity. eg: text1 = 'aaababac' and text2 = 'aaaaabac' should be grouped together since only one characters differs between them. And the scenario is : first text1 comes up the algorithm should give it an index. then the text2 comes up now the algorithm employs the same method to give it an index. but the condition is the both indexes should be near to each other and while processing text2 the algorithm has no idea what came up in earlier texts. It is sort of pattern similarity based hashing.

Now I cant find anything useful. The best solution that I found was simhash. http://matpalm.com/resemblance/simhash/

like image 247
Abinash Koirala Avatar asked Dec 31 '25 05:12

Abinash Koirala


1 Answers

The problem is a bit underspecified. If you cannot remember previous entries, how are you going to remember the clusters you have seen? And in particular, usually things are only considered a cluster once you have seen a significant amount of "similar" items. You cannot do this without having at least some "memory" of what is frequent and what isn't. Therefore, there is no reasonable clustering algorithm that really does not have any memory. It might not be memorizing the literal objects, but memorizing summaries is not really that different. Hashing means memorizing at least parts of the previously seen data. But is memorizing a statistically signficiant random part of the data that much of a benefit over remembering it exactly?

Much of the things happening are pretending to be not memorizing things, but in fact they are just memorizing the data differently. But as long as it gets published, it is to be considered a success. Even if it doesn't work in practise.

like image 121
Has QUIT--Anony-Mousse Avatar answered Jan 03 '26 14:01

Has QUIT--Anony-Mousse