Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a fast algorithm to remove repeated substrings in a string?

There is a string like it

"dxabcabcyyyydxycxcxz"

and I want to merge it into

"dxabcydxycxz"

Other examples: "ddxddx" -> "dxdx" , "abbab" -> "abab".

The rule is that :

if (adjacent and same): merge

# Such as 'abc', they are same, so delete one of them
# Although 'dx' is same as 'dx', they are nonadjacent, so do not delete any of them
# If one character has been deleted, don't delete any substring, include it

I've done it in Python, but it's slow when applied to a long string.

# Original string
mystr = "dxabcabcyyyydxycxcxz"
str_len = len(mystr)
vis = [1] * str_len  # Use a list to mark which char is deleted

# Enumerate the size of substring
for i in range(1,str_len):
    # Enumerate the begin of the substring
    for j in range(0, str_len):
        offset = 2 #the size of sub-str + 1
        current_sub_str = mystr[j:j+i]
        s_begin = j+i*(offset-1)
        s_end = j+(i*offset)
        # Delete all of the same char
        while((j+(i*offset) <= str_len) and current_sub_str == mystr[s_begin:s_end]
              and 0  not in vis[s_begin:s_end] and 0  not in vis[j:j+i]):
            vis[s_begin:s_end] = [0] * (s_end - s_begin)  # If it was deleted, mark it as 0
            offset += 1
            s_begin = j + i * (offset - 1)
            s_end = j + (i * offset)

res = []
for i in range(0,str_len):
    if(vis[i]!=0): res.append(mystr[i])

print "".join(res)
   

Is there any faster way to solve it?

Update April 29, 2017

Sorry, it seems to be like an XY problem. On the other hand, it maybe not. There is the content I was coding for a web spider and got many 'tag-path's like those:

ul/li/a
ul/li/div/div/div/a/span
ul/li/div/div/div/a/span 
ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a

As you see, some of the 'tag-path's are identical, so I wanted to collapse them to find out if there is any other 'tag-path's with the same structure.

After collapsing, I get the 'tag-path' like this.

ul/li/a
ul/li/div/div/div/a/span
ul/li/a
ul/li/ul/li/a
ul/li/a
ul/li/ul/li/a
ul/li/a
ul/li/ul/li/a

This is only my idea and I didn't know whether it is suitable to do this way. (After trying, I chose another way to do it).

However there is an interesting question like an ACM question.

So, I simplify one 'tag-path' to a character and ask for help. Because I didn't do a fast way by myself. Actually, the question has many corner cases that I don't mind and thank all for helping me to complete it.

Thanks all.


like image 353
haomao Avatar asked Apr 28 '17 09:04

haomao


1 Answers

Behold the power of regex:

>>> import re

>>> re.sub(r"(.+?)\1+", r"\1", "dxabcabcyyyydxycxcxz")
'dxabcydxycxz'

>>> re.sub(r"(.+?)\1+", r"\1", "ddxddx")
'dxdx'

>>> re.sub(r"(.+?)\1+", r"\1", "abbab")
'abab'

This looks for a sequence of 1 or more arbitrary characters (.+?) (as a non-greedy match, so that it tries shorter sequences first), followed by 1 or more repetitions of the matched sequence \1+, and replaces it all with just the matched sequence \1.

like image 116
jasonharper Avatar answered Oct 12 '22 15:10

jasonharper