Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

remove fragments in a sentence [puzzle]

Question:

Write a program to remove fragment that occur in "all" strings,where a fragment is 3 or more consecutive word.

Example:

Input::

s1 = "It is raining and I want to drive home.";

s2 = "It is raining and I want to go skiing.";

s3 = "It is hot and I want to go swimming.";

Output::

s1 = "It is raining drive home.";

s2 = "It is raining go skiing.";

s3 = "It is hot go swimming.";

Removed fragment = "and i want to"

The program will be tested again large files. Efficiency will be taken into consideration.

Assumptions: Ignore capitalization ,punctuation. but preserve in output.

Note: Take care of cases like

a a a a a b c b c b c b c where removing would create more fragments.

My Solution: (which i think is not the most efficient)

  1. Hash three word phrases into an int and store them in an array, for all strings. reduces to array of numbers like

    1 2 3 4 5
    3 5 7 9 8
    9 3 1 7 9

Problem reduces to intersection of arrays.

sort the arrays. (k * nlogn)

keep k pointers. if all equal match found. else increment the pointer pointing to least value. To solve for the Note above. I was thinking of doing a lazy delete, i.e mark phrases for deletion and delete at the end.

Are there cases where my solution might not work? Can we optimize my solution/ find the best solution ?

like image 869
Kshitij Banerjee Avatar asked Feb 02 '12 14:02

Kshitij Banerjee


2 Answers

First observation: replace each word with a single "letter" in a big alphabet(i.e. hash the worlds in some way), remove whitespaces and punctuation.

Now you have the problem reduced to remove the longest letter sequence that appears in all words of a given list. So you have to compute the longest common substring for a set of "words". You find it using a generalized suffix tree as this is the most efficient algorithm. This should do the trick and I believe has the best possible complexity.

like image 130
Ivaylo Strandjev Avatar answered Sep 19 '22 10:09

Ivaylo Strandjev


The first step is as already suggested by izomorphius:

Replace each word with a single "letter" in a big alphabet(i.e. hash the worlds in some way), remove whitespaces and punctuation.

For the second you don't need to know the longest common substring - you just want to erase it from all the strings. Note that this is equivalent to erasing all common substrings of length exactly 3, because if you have a longer commmon substring, then its substrings with length 3 are also common. To do that you can use a hash table (storing key value pairs).

Just iterate over the first string and put all it's 3-substrings into the hash table as keys with values equal to 1. Then iterate over the second string and for each 3-substring x if x is in the hash table and its value is 1, then set the value to 2. Then iterate over the third string and for each 3-substring x, if x is in the hash table and its value is 2, then set the value to 3. ...and so on. At the end the keys that have the value of k are the common 3-substrings.

Now just iterate once more over all the strings and remove those 3-substrings that are common.

like image 42
Petar Ivanov Avatar answered Sep 19 '22 10:09

Petar Ivanov