Grouping millions of strings with substitution

Question

I have a large amount(XXM-XXXM) strings that look like (a small sample):

I have no idea of all the possible error strings, nor of the permutations thereof. I want to group all similar errors together, and generate some statistics showing an error count for each error string group.

So, essentially, I'd like to group the most similar strings together, and strings can belong to multiple groups.

Thanks!

Daren Thomas · Accepted Answer

Disclaimer: I have never solved a problem like this before.

I can think of a couple of ways to think of your problem:

you are trying to cluster each line to a set of clusters
- check into datamining algorithms
the diff between each line in a cluster will be small, between lines from two different clusters will be rather bigger
you could quickly gather similar lines with, by comparing the set intersection of two lines: set(line1.split) & set(line2.split) - the element count in the resulting set is an indicator of how close these two lines are.

A bit of python code could look like this:

import fileinput

CLUSTER_COUNT = 5
MAX_DISTANCE = 5

def main():
    clusters = [Cluster() for i in range(CLUSTER_COUNT)]
    MAXDISTANCE = 3
    for line in  fileinput.input():
        words = set(line.split())
        cluster = sorted(clusters, key=lambda c: c.distanceTo(words))[0]
        cluster.addLine(words, line)

    # print out results (FIXME: write clusters to separate files)
    for cluster in clusters:
        print "CLUSTER:", cluster.intersection
        for line in cluster.lines:
            print line
        print "-" * 80
        print

class Cluster(object):
    def __init__(self):
        self.intersection = set()
        self.lines = []
    def distanceTo(self, words):
        if len(self.intersection) == 0:
            return MAX_DISTANCE 
        return len(words) - len(self.intersection & words)
    def addLine(self, words, line):
        self.lines.append(line) 
        if len(self.intersection) == 0:
            self.intersection = words
        else:
            self.intersection = self.intersection & words

if __name__ == '__main__':
    main()

If you run it on your main data, you should end up with a couple of clusters. Note: alter the code to write the clusters to separate files. I think you will want to run the clusters through the code again, recursively, until you find the subsets you're interested in.

Grouping millions of strings with substitution

Tags:

string

John Wu

1 Answers

Daren Thomas

Recent Activity

Donate For Us

Grouping millions of strings with substitution

Tags:

string

John Wu

1 Answers

Daren Thomas

Related questions

Recent Activity

Donate For Us