Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Grouping millions of strings with substitution

Tags:

string

I have a large amount(XXM-XXXM) strings that look like (a small sample):

I have no idea of all the possible error strings, nor of the permutations thereof. I want to group all similar errors together, and generate some statistics showing an error count for each error string group.

So, essentially, I'd like to group the most similar strings together, and strings can belong to multiple groups.

Thanks!

like image 797
John Wu Avatar asked Jun 01 '11 07:06

John Wu


1 Answers

Disclaimer: I have never solved a problem like this before.

I can think of a couple of ways to think of your problem:

  • you are trying to cluster each line to a set of clusters
    • check into datamining algorithms
  • the diff between each line in a cluster will be small, between lines from two different clusters will be rather bigger
  • you could quickly gather similar lines with, by comparing the set intersection of two lines: set(line1.split) & set(line2.split) - the element count in the resulting set is an indicator of how close these two lines are.

A bit of python code could look like this:

import fileinput

CLUSTER_COUNT = 5
MAX_DISTANCE = 5

def main():
    clusters = [Cluster() for i in range(CLUSTER_COUNT)]
    MAXDISTANCE = 3
    for line in  fileinput.input():
        words = set(line.split())
        cluster = sorted(clusters, key=lambda c: c.distanceTo(words))[0]
        cluster.addLine(words, line)

    # print out results (FIXME: write clusters to separate files)
    for cluster in clusters:
        print "CLUSTER:", cluster.intersection
        for line in cluster.lines:
            print line
        print "-" * 80
        print

class Cluster(object):
    def __init__(self):
        self.intersection = set()
        self.lines = []
    def distanceTo(self, words):
        if len(self.intersection) == 0:
            return MAX_DISTANCE 
        return len(words) - len(self.intersection & words)
    def addLine(self, words, line):
        self.lines.append(line) 
        if len(self.intersection) == 0:
            self.intersection = words
        else:
            self.intersection = self.intersection & words

if __name__ == '__main__':
    main()

If you run it on your main data, you should end up with a couple of clusters. Note: alter the code to write the clusters to separate files. I think you will want to run the clusters through the code again, recursively, until you find the subsets you're interested in.

like image 137
Daren Thomas Avatar answered Oct 07 '22 01:10

Daren Thomas