Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I detect common substrings in a list of strings

Given a set of strings, for example:

EFgreen EFgrey EntireS1 EntireS2 J27RedP1 J27GreenP1 J27RedP2 J27GreenP2 JournalP1Black JournalP1Blue JournalP1Green JournalP1Red JournalP2Black JournalP2Blue JournalP2Green 

I want to be able to detect that these are three sets of files:

  • EntireS[1,2]
  • J27[Red,Green]P[1,2]
  • JournalP[1,2][Red,Green,Blue]

Are there any known ways of approaching this problem - any published papers I can read on this?

The approach I am considering is for each string look at all other strings and find the common characters and where differing characters are, trying to find sets of strings that have the most in common, but I fear that this is not very efficient and may give false positives.

Note that this is not the same as 'How do I detect groups of common strings in filenames' because that assumes that a string will always have a series of digits following it.

like image 629
danio Avatar asked Sep 11 '09 13:09

danio


People also ask

How do you find common substrings?

For every character in string 1 we increment vector index of that character eg: v[s1[i]-'a']++, for every character of string 2 we check vector for the common characters if v[s2[i]-'a'] > 0 then set flag = true and v[s2[i]-'a']– such that one character of string 2 is compared with only one character of string 1.

How do you find the common substring in a list of strings in Python?

To find common substrings between two strings with Python, we can use the difflib module. We have 2 strings string1 and string2 that we want to find the common substring that's in both strings. To do that, we use the SequenceMatcher class with string1 and string2 .

How do you check if a string contains a list of substring?

The in Operator It returns a Boolean (either True or False ). To check if a string contains a substring in Python using the in operator, we simply invoke it on the superstring: fullstring = "StackAbuse" substring = "tack" if substring in fullstring: print("Found!") else: print("Not found!")


2 Answers

I would start here: http://en.wikipedia.org/wiki/Longest_common_substring_problem

There are links to supplemental information in the external links, including Perl implementations of the two algorithms explained in the article.

Edited to add:

Based on the discussion, I still think Longest Common Substring could be at the heart of this problem. Even in the Journal example you reference in your comment, the defining characteristic of that set is the substring 'Journal'.

I would first consider what defines a set as separate from the other sets. That gives you your partition to divide up the data, and then the problem is in measuring how much commonality exists within a set. If the defining characteristic is a common substring, then Longest Common Substring would be a logical starting point.

To automate the process of set detection, in general, you will need a pairwise measure of commonality which you can use to measure the 'difference' between all possible pairs. Then you need an algorithm to compute the partition that results in the overall lowest total difference. If the difference measure is not Longest Common Substring, that's fine, but then you need to determine what it will be. Obviously it needs to be something concrete that you can measure.

Bear in mind also that the properties of your difference measurement will bear on the algorithms that can be used to make the partition. For example, assume diff(X,Y) gives the measure of difference between X and Y. Then it would probably be useful if your measure of distance was such that diff(A,C) <= diff(A,B) + diff(B,C). And obviously diff(A,C) should be the same as diff(C,A).

In thinking about this, I also begin to wonder whether we could conceive of the 'difference' as a distance between any two strings, and, with a rigorous definition of the distance, could we then attempt some kind of cluster analysis on the input strings. Just a thought.

like image 189
Jeremy Bourque Avatar answered Oct 09 '22 22:10

Jeremy Bourque


Great question! The steps to a solution are:

  1. tokenizing input
  2. using tokens to build an appropriate data structure. a DAWG is ideal, but a Trie is simpler and a decent starting point.
  3. optional post-processing of the data structure for simplification or clustering of subtrees into separate outputs
  4. serialization of the data structure to a regular expression or similar syntax

I've implemented this approach in regroup.py. Here's an example:

$ cat | ./regroup.py --cluster-prefix-len=2 EFgreen EFgrey EntireS1 EntireS2 J27RedP1 J27GreenP1 J27RedP2 J27GreenP2 JournalP1Black JournalP1Blue JournalP1Green JournalP1Red JournalP2Black JournalP2Blue JournalP2Green ^D EFgre(en|y) EntireS[12] J27(Green|Red)P[12] JournalP[12](Bl(ack|ue)|(Green|Red)) 
like image 32
Ryan Flynn Avatar answered Oct 09 '22 22:10

Ryan Flynn