To explain in a clearer way my question I will start by explaining the real-life case I am facing. I am building a physical panel with many words on it that can be selectively lit, in order to compose sentences. This is my situation: <ol> <li>I know all the sentences that I want to display</li> <li>I want to find out [one of] the shortest set of ORDERED words that allows me to display all the sentences </li> </ol> Example: <pre class="prettyprint"><code> SENTENCES: "A dog is on the table" "A cat is on the table" SOLUTIONS: "A dog cat is on the table" "A cat dog is on the table" </code></pre> I tried to approach this problem with "positional rules" finding for each UNIQUE word in the set of ALL the words used in ALL the sentences, what words should be at the left or at the right of it. In the example above, the ruleset for the "on" word would be "left(A, dog, cat, is) + right(the, table). This approach worked for trivial cases, but my real-life situation has two additional difficulties that got me stuck and that have both to do with the need for repeating words: <ol> <li> In-sentence repetitions: "the cat is on the table" has two "the".</li> <li> Circular references: In a set of three sentences "A red cat" + "My cat is on the table" + "That table is red", the rules would state that RED should be at the left of CAT, CAT should be at the left of TABLE and TABLE should be at the left of RED.</li> </ol> MY QUESTION THEREFORE IS: <blockquote> What is the class of algorithms (or even better: what is the specific algorithm) that studies and solves this kind of problems? Could you post some reference or a code example of it? </blockquote> EDIT: Level of complexity From the first round of answers it appears the actual level of complexity (i.e. how different are the sentences one from the other) is an important factor. So, here comes some info on that: <ol> <li>I have about 1500 sentences I want to represent.</li> <li>All of the sentences are essentially modifications of a restricted pool of ~10 sentences where only a few words change. Building on the previous example, it's a bit like all my sentences would speak about either "somebody's pet's position relative to a piece of furniture" or "a physical description of somebody's furniture".</li> <li>The number of unique words used to build all the sentences is <100.</li> <li>Sentences are 8 words long at most.</li> </ol> For this project I am using python, but any language reasonably readable (eg: NOT obfuscated perl!) will be fine. Thank you in advance for your time!

If I understand you correctly, this is equivalent to the shortest common supersequence problem. This problem is NP-complete, but there exists approximation algorithms. Google turns up a few papers, including this one. The problem can be solved with a simple DP algorithm in the case of two input sequences, but this doesn't generalize to multiple sequences since each sequence essentially requires you to add a dimension to the DP table which results in the exponential blow-up.

How to perform a sorting according to rules but with repetition of items to solve circular references?

Q: How do I iterate a circular reference in Excel?

For circular formulas to work, you must enable iterative calculations in your Excel workbook. In Excel 2019, Excel 2016, Excel 2013, and Excel 2010, click File > Options, go to Formulas, and select the Enable iterative calculation check box under the Calculation options section.

Tags:

python

language-agnostic

algorithm

sorting

circular-reference

To explain in a clearer way my question I will start by explaining the real-life case I am facing.

I am building a physical panel with many words on it that can be selectively lit, in order to compose sentences. This is my situation:

I know all the sentences that I want to display
I want to find out [one of] the shortest set of ORDERED words that allows me to display all the sentences

Example:

 SENTENCES:
 "A dog is on the table"
 "A cat is on the table"
 SOLUTIONS:
 "A dog cat is on the table"
 "A cat dog is on the table"

I tried to approach this problem with "positional rules" finding for each UNIQUE word in the set of ALL the words used in ALL the sentences, what words should be at the left or at the right of it. In the example above, the ruleset for the "on" word would be "left(A, dog, cat, is) + right(the, table).

This approach worked for trivial cases, but my real-life situation has two additional difficulties that got me stuck and that have both to do with the need for repeating words:

In-sentence repetitions: "the cat is on the table" has two "the".
Circular references: In a set of three sentences "A red cat" + "My cat is on the table" + "That table is red", the rules would state that RED should be at the left of CAT, CAT should be at the left of TABLE and TABLE should be at the left of RED.

MY QUESTION THEREFORE IS:

What is the class of algorithms (or even better: what is the specific algorithm) that studies and solves this kind of problems? Could you post some reference or a code example of it?

EDIT: Level of complexity

From the first round of answers it appears the actual level of complexity (i.e. how different are the sentences one from the other) is an important factor. So, here comes some info on that:

I have about 1500 sentences I want to represent.
All of the sentences are essentially modifications of a restricted pool of ~10 sentences where only a few words change. Building on the previous example, it's a bit like all my sentences would speak about either "somebody's pet's position relative to a piece of furniture" or "a physical description of somebody's furniture".
The number of unique words used to build all the sentences is <100.
Sentences are 8 words long at most.

For this project I am using python, but any language reasonably readable (eg: NOT obfuscated perl!) will be fine.

Thank you in advance for your time!

456

asked Apr 26 '11 01:04

mac

2 Answers

If I understand you correctly, this is equivalent to the shortest common supersequence problem. This problem is NP-complete, but there exists approximation algorithms. Google turns up a few papers, including this one.

The problem can be solved with a simple DP algorithm in the case of two input sequences, but this doesn't generalize to multiple sequences since each sequence essentially requires you to add a dimension to the DP table which results in the exponential blow-up.

114

answered Sep 22 '22 15:09

hammar

I'm a bioinformatician, and this sounds like it could be solved by doing a global multiple sequence alignment of all the sentences with infinite mismatch penalties (i.e. disallow mismatches entirely) and modest gap penalties (i.e. allow gaps, but prefer fewer gaps), and then reading off the gapless consensus sequence.

If this formulation is equivalent to your problem, then that means your problem is indeed NP-complete, since multiple sequence alignment is NP-complete, although there are many heuristic algorithms that run in reasonable time. Unfortunately, most MSA algorithms are designed to work on characters of DNA or protein sequences, not words of English.

Example

Here is an example of the kind of alignment that I describe, using the set of three sentences given by the OP. I don't know if the alignment that I give is optimal, but it is one possible solution. Gaps are indicated by a series of dashes.

Sentence 1: ---- -- A red cat -- -- --- ----- -- ---
Sentence 2: ---- My - --- cat is on the table -- ---
Sentence 3: That -- - --- --- -- -- --- table is red
Consensus:  That My A red cat is on the table is red

One advantage of this method is that the alignment not only gives you the full sequence of words, but shows which words belong in which sentences.

answered Sep 24 '22 15:09

Ryan C. Thompson

Related questions
                            
                                How to mount a network directory using python?
                            
                                Transaction within transaction
                            
                                how can I make a suggestion for a new feature in python
                            
                                Entity references and lxml
                            
                                In my virtualenv, I need to use sudo for all commands
                            
                                Controlling a browser using Python, on a Mac
                            
                                Binomial test in Python for very large numbers
                            
                                Forcing scons to use older compiler?
                            
                                In python, why is reading from an array slower than reading from list?
                            
                                stdout to tkinter GUI
                            
                                problems setting up Django - ValueError: Empty Module name
                            
                                Is it possible to wrap the text of xticks in matplotlib in python?
                            
                                Is there any reason for using classes in Python if there is only one class in the program?
                            
                                Is serialization a must in order to transfer data across the wire?
                            
                                What does [:] do?
                            
                                How can I partially sort a Python list?
                            
                                Python: What does _("str") do?
                            
                                How to run programs in python2 and python3
                            
                                Django - Group By with Date part alone
                            
                                Adding images to a QTableWidget in PyQt

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With