Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to perform a sorting according to rules but with repetition of items to solve circular references?

To explain in a clearer way my question I will start by explaining the real-life case I am facing.

I am building a physical panel with many words on it that can be selectively lit, in order to compose sentences. This is my situation:

  1. I know all the sentences that I want to display
  2. I want to find out [one of] the shortest set of ORDERED words that allows me to display all the sentences

Example:

 SENTENCES:
 "A dog is on the table"
 "A cat is on the table"
 SOLUTIONS:
 "A dog cat is on the table"
 "A cat dog is on the table"

I tried to approach this problem with "positional rules" finding for each UNIQUE word in the set of ALL the words used in ALL the sentences, what words should be at the left or at the right of it. In the example above, the ruleset for the "on" word would be "left(A, dog, cat, is) + right(the, table).

This approach worked for trivial cases, but my real-life situation has two additional difficulties that got me stuck and that have both to do with the need for repeating words:

  1. In-sentence repetitions: "the cat is on the table" has two "the".
  2. Circular references: In a set of three sentences "A red cat" + "My cat is on the table" + "That table is red", the rules would state that RED should be at the left of CAT, CAT should be at the left of TABLE and TABLE should be at the left of RED.

MY QUESTION THEREFORE IS:

What is the class of algorithms (or even better: what is the specific algorithm) that studies and solves this kind of problems? Could you post some reference or a code example of it?

EDIT: Level of complexity

From the first round of answers it appears the actual level of complexity (i.e. how different are the sentences one from the other) is an important factor. So, here comes some info on that:

  1. I have about 1500 sentences I want to represent.
  2. All of the sentences are essentially modifications of a restricted pool of ~10 sentences where only a few words change. Building on the previous example, it's a bit like all my sentences would speak about either "somebody's pet's position relative to a piece of furniture" or "a physical description of somebody's furniture".
  3. The number of unique words used to build all the sentences is <100.
  4. Sentences are 8 words long at most.

For this project I am using python, but any language reasonably readable (eg: NOT obfuscated perl!) will be fine.

Thank you in advance for your time!

like image 456
mac Avatar asked Apr 26 '11 01:04

mac


People also ask

How do I iterate a circular reference in Excel?

For circular formulas to work, you must enable iterative calculations in your Excel workbook. In Excel 2019, Excel 2016, Excel 2013, and Excel 2010, click File > Options, go to Formulas, and select the Enable iterative calculation check box under the Calculation options section.

How can I find a circular reference in multiple Excel sheets?

Manually detect Circular References in ExcelGo to tab 'Formulas', choose 'Error-checking' and 'Circular References'. Excel will show you exactly in which cell(s) circular references are detected.


2 Answers

If I understand you correctly, this is equivalent to the shortest common supersequence problem. This problem is NP-complete, but there exists approximation algorithms. Google turns up a few papers, including this one.

The problem can be solved with a simple DP algorithm in the case of two input sequences, but this doesn't generalize to multiple sequences since each sequence essentially requires you to add a dimension to the DP table which results in the exponential blow-up.

like image 114
hammar Avatar answered Sep 22 '22 15:09

hammar


I'm a bioinformatician, and this sounds like it could be solved by doing a global multiple sequence alignment of all the sentences with infinite mismatch penalties (i.e. disallow mismatches entirely) and modest gap penalties (i.e. allow gaps, but prefer fewer gaps), and then reading off the gapless consensus sequence.

If this formulation is equivalent to your problem, then that means your problem is indeed NP-complete, since multiple sequence alignment is NP-complete, although there are many heuristic algorithms that run in reasonable time. Unfortunately, most MSA algorithms are designed to work on characters of DNA or protein sequences, not words of English.

Example

Here is an example of the kind of alignment that I describe, using the set of three sentences given by the OP. I don't know if the alignment that I give is optimal, but it is one possible solution. Gaps are indicated by a series of dashes.

Sentence 1: ---- -- A red cat -- -- --- ----- -- ---
Sentence 2: ---- My - --- cat is on the table -- ---
Sentence 3: That -- - --- --- -- -- --- table is red
Consensus:  That My A red cat is on the table is red

One advantage of this method is that the alignment not only gives you the full sequence of words, but shows which words belong in which sentences.

like image 45
Ryan C. Thompson Avatar answered Sep 24 '22 15:09

Ryan C. Thompson