I have a large amount(XXM-XXXM) strings that look like (a small sample):
I have no idea of all the possible error strings, nor of the permutations thereof. I want to group all similar errors together, and generate some statistics showing an error count for each error string group.
So, essentially, I'd like to group the most similar strings together, and strings can belong to multiple groups.
Thanks!
Disclaimer: I have never solved a problem like this before.
I can think of a couple of ways to think of your problem:
set(line1.split) & set(line2.split)
- the element count in the resulting set is an indicator of how close these two lines are.A bit of python code could look like this:
import fileinput
CLUSTER_COUNT = 5
MAX_DISTANCE = 5
def main():
clusters = [Cluster() for i in range(CLUSTER_COUNT)]
MAXDISTANCE = 3
for line in fileinput.input():
words = set(line.split())
cluster = sorted(clusters, key=lambda c: c.distanceTo(words))[0]
cluster.addLine(words, line)
# print out results (FIXME: write clusters to separate files)
for cluster in clusters:
print "CLUSTER:", cluster.intersection
for line in cluster.lines:
print line
print "-" * 80
print
class Cluster(object):
def __init__(self):
self.intersection = set()
self.lines = []
def distanceTo(self, words):
if len(self.intersection) == 0:
return MAX_DISTANCE
return len(words) - len(self.intersection & words)
def addLine(self, words, line):
self.lines.append(line)
if len(self.intersection) == 0:
self.intersection = words
else:
self.intersection = self.intersection & words
if __name__ == '__main__':
main()
If you run it on your main data, you should end up with a couple of clusters. Note: alter the code to write the clusters to separate files. I think you will want to run the clusters through the code again, recursively, until you find the subsets you're interested in.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With