Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

String analysis

Given a sequence of operations:

a*b*a*b*a*a*b*a*b

is there a way to get the optimal subdivision to enable reusage of substring.

making

a*b*a*b*a*a*b*a*b => c*a*c, where c = a*b*a*b

and then seeing that

a*b*a*b => d*d, where d = a*b

all in all reducing the 8 initial operations into the 4 described here?

(c = (d = a*b)*d)*a*c

The goal of course is to minimize the number of operations

I'm considering a suffixtree of sorts.

I'm especially interested in linear time heuristics or solutions. The '*' operations are actually matrix multiplications.

like image 671
Martin Kristiansen Avatar asked May 12 '11 09:05

Martin Kristiansen


People also ask

What is a STRING analysis?

STRING is a database of known and predicted protein-protein interactions. The interactions include direct (physical) and indirect (functional) associations; they stem from computational prediction, from knowledge transfer between organisms, and from interactions aggregated from other (primary) databases.

What is STRING analysis used for?

In molecular biology, STRING (Search Tool for the Retrieval of Interacting Genes/Proteins) is a biological database and web resource of known and predicted protein–protein interactions.

What is STRING used for bioinformatics?

STRING is a proteomic database focusing on the networks and interactions of proteins in a wide array of species. STRING allows for the searching of one or multiple proteins at a time with the ability to additionally limit the search to the desired species.


Video Answer


1 Answers

This whole problem is known as "Common Subexpression Elimination" or CSE. It is a slightly smaller version of the problem called "Graph Reduction" faced by the implementer of compilers for functional programming languages. Googling "Common Subexpression elimination algorithm" gives lots of solutions, though none that I can see especially for the constraints given by matrix multiplication.

The pages linked to give a lot of references.

My old answer is below. However, having researched a bit more, the solution is simply building a suffix tree. This can be done in O(N) time (lots of references on the wikipedia page). Having done this, the sub-expressions (c, d etc. in your question) are just nodes in the suffix tree - just pull them out.


However, I think MarcoS is on to something with the suggestion of Longest repeating Substring, as graph reduction precedence might not allow optimisations that can be allowed here.

sketch of algorithm:

optimise(s):     let sub = longestRepeatingSubstring(s).     optimisedSub = optimise(sub)     return s with sub replaced by optimisedSub 

Each run of longest repeating substring takes time N. You can probably re-use the suffix tree you build to solve the whole thing in time N.

like image 75
Nick Fortescue Avatar answered Sep 22 '22 04:09

Nick Fortescue