I need to use the Sequitur algorithm for a bioinformatics project but so far I haven't found any Python implementations of the algorithm. This is what I've got so far:
def new_rule(sequence,rules,x,y,z):
y = list(y)
if max(y) > 1:
ind = y.index(max(y))
locations = np.zeros(max(y))
counter = 0
for i, j in enumerate(z):
if j == x[ind]:
locations[counter] = i
counter+=1
new_rule = 'A'+str(len(rules)+1)
rules.append([new_rule,x[ind]])
return sequence.replace(x[ind],new_rule), rules
def word_count(sequence):
return len(sequence)-sequence.count(' ')
def recursive_sequitur(sequence,rules):
count_down = word_count(sequence)-1
x,y,z = n_grams(sequence,count_down)
while count_down>1:
if len(y) > 1 and max(y) > 1:
sequence, rules = new_rule(sequence,rules,x,y,z)
x,y,z = n_grams(sequence,count_down)
else:
count_down-=1
x,y,z = n_grams(sequence,count_down)
return sequence, rules
Have you seen Parallel Sequitur for Python?
We implemented serial and parallel versions of the Sequitur compression algorithm. The Sequitur algorithm uses hierarchical structure and sequences of discrete symbols to compress files by exploiting repetative structures found in strings.
concat_parallel.py: simple parallel implementation of the Sequitur algorithm. Text is split among workers which run a version of the serial algorithm before the main process merges all results in order.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With