Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding repeated character combinations in string

I have a string that holds a very long sentence without whitespaces/spaces.

mystring = "abcdthisisatextwithsampletextforasampleabcd"

I would like to find all of the repeated substrings that contains minimum 4 chars.

So I would like to achieve something like this:

'text' 2 times
'sample' 2 times
'abcd' 2 times

As both abcd,text and sample can be found two times in the mystring they were recognized as properly matched substrings with more than 4 char length. It's important that I am seeking repeated substrings, finding only existing English words is not a requirement.

The answers I found are helpful for finding duplicates in texts with whitespaces, but I couldn't find a proper resource that covers the situation when there are no spaces and whitespaces in the string. How can this be done in the most efficient way?

like image 480
rihe Avatar asked Mar 11 '18 17:03

rihe


1 Answers

Let's go through this step by step. There are several sub-tasks you should take care of:

  1. Identify all substrings of length 4 or more.
  2. Count the occurrence of these substrings.
  3. Filter all substrings with 2 occurrences or more.

You can actually put all of them into a few statements. For understanding, it is easier to go through them one at a time.

The following examples all use

mystring = "abcdthisisatextwithsampletextforasampleabcd"
min_length = 4

1. Substrings of a given length

You can easily get substrings by slicing - for example, mystring[4:4+6] gives you the substring from position 4 of length 6: 'thisis'. More generically, you want substrings of the form mystring[start:start+length].

So what values do you need for start and length?

  • start must...
    • cover all substrings, so it must include the first character: start in range(0, ...).
    • not map to short substrings, so it can stop min_length characters before the end: start in range(..., len(mystring) - min_length + 1).
  • length must...
    • cover the shortest substring of length 4: length in range(min_length, ...).
    • not exceed the remaining string after i: length in range(..., len(mystring) - i + 1))

The +1 terms come from converting lengths (>=1) to indices (>=0). You can put this all together into a single comprehension:

substrings = [
    mystring[i:i+j]
    for i in range(0, len(mystring) - min_length + 1)
    for j in range(min_length, len(mystring) - i + 1)
]

2. Count substrings

Trivially, you want to keep a count for each substring. Keeping anything for each specific object is what dicts are made for. So you should use substrings as keys and counts as values in a dict. In essence, this corresponds to this:

counts = {}
for substring in substrings:
    try:  # increase count for existing keys, set for new keys
         counts[substring] += 1
    except KeyError:
         counts[substring] = 1

You can simply feed your substrings to collections.Counter, and it produces something like the above.

>>> counts = collections.Counter(substrings)
>>> print(counts)
Counter({'abcd': 2, 'abcdt': 1, 'abcdth': 1, 'abcdthi': 1, 'abcdthis': 1, ...})

Notice how the duplicate 'abcd' maps to the count of 2.

3. Filtering duplicate substrings

So now you have your substrings and the count for each. You need to remove the non-duplicate substrings - those with a count of 1.

Python offers several constructs for filtering, depending on the output you want. These work also if counts is a regular dict:

>>> list(filter(lambda key: counts[key] > 1, counts))
['abcd', 'text', 'samp', 'sampl', 'sample', 'ampl', 'ample', 'mple']
>>> {key: value for key, value in counts.items() if value > 1}
{'abcd': 2, 'ampl': 2, 'ample': 2, 'mple': 2, 'samp': 2, 'sampl': 2, 'sample': 2, 'text': 2}

Using Python primitives

Python ships with primitives that allow you to do this more efficiently.

  1. Use a generator to build substrings. A generator builds its member on the fly, so you never actually have them all in-memory. For your use case, you can use a generator expression:

     substrings = (
         mystring[i:i+j]
         for i in range(0, len(mystring) - min_length + 1)
         for j in range(min_length, len(mystring) - i + 1)
     )
    
  2. Use a pre-existing Counter implementation. Python comes with a dict-like container that counts its members: collections.Counter can directly digest your substring generator. Especially in newer version, this is much more efficient.

     counts = collections.Counter(substrings)
    
  3. You can exploit Python's lazy filters to only ever inspect one substring. The filter builtin or another generator generator expression can produce one result at a time without storing them all in memory.

     for substring in filter(lambda key: counts[key] > 1, counts):
         print(substring, 'occurs', counts[substring], 'times')
    
like image 159
MisterMiyagi Avatar answered Nov 15 '22 09:11

MisterMiyagi