Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find combination of string sentences - Combinations of frequency tables to target frequency table

The problem is explained in the following article.

I have a list of sentences, for example a list of 1000 sentences.

I would like to find a combination of sentences to match/'match closest' a certain frequency table:

[a:100, b:80, c:90, d:150, e:100, f:100, g:47, h:10 ..... z:900]

I thought about finding all possible combinations from the sentences list by using combinations like in here (so comb(1000, 1); to comb(1000, 1000); ) and then compare every combination with the frequency table, so that distance is minimum. So sum all frequency tables from a possible combination and compare this sum with the target, the combination with the smallest difference with the target should be recorded. There could be multiple combinations that match closest.

The problem is that the calculation of all combinations takes way too long to complete, apparently couple of days. Is there a known algorithm that could solve this efficiently? Ideally couple of minutes maximum?

Input sentences:

More RVs were seen in the storage lot than at the campground.

She did her best to help him. There have been days when I wished to be separated from my body, but today wasn’t one of those days.

The swirled lollipop had issues with the pop rock candy.

The two walked down the slot canyon oblivious to the sound of thunder in the distance.

Acres of almond trees lined the interstate highway which complimented the crazy driving nuts.

He is no James Bond; his name is Roger Moore.

The tumbleweed refused to tumble but was more than willing to prance.

She was disgusted he couldn’t tell the difference between lemonade and > limeade.

He didn’t want to go to the dentist, yet he went anyway.

Find combination of sentences that match the following frequency table closest:

[a:5, b:5, c:5, d:5, e:5, f:5, g:5, h:5 ..... z:5]

Example:

Frequency table of sixth sentence

He is no James Bond; his name is Roger Moore.

is [a:2, e:5, g:1, h:1, i:3, j:1, m:3, n:3, o:5, r:3, s:4]

Frequency table takes upper and lower equal and excludes special characters.

like image 451
BigChief Avatar asked Dec 04 '21 11:12

BigChief


2 Answers

Whenever someone find a combination of sentences with 3c, 3a, 3b, 3d or 30c, 30a, 30b, 30d from the following sentences with 5% above or below it can be solved.

S1: aaaaaaaaaaaaaaaaaa bbbbbb c
S2: aaaaaaaa bbbbbbbb d
S3: aaaaaaaaaaa bbbbbbbbb c dd
S4: aaaaaaaaaa bbbbbbbb 

Be realistic. There is No solution, not NP-hard nor NP-complete, No solution. The number of occurrence of letters in a sentence (for example vowels like i or a) is not equal to others (like x or w). We can just find best matches like the code provided here or change the requirement. I tried to solve this with KnapSack algorithm and Euclidean distance and Standard deviation, but none gives me such answer since there is no sentence with the same size of letters.

like image 64
Majid Hajibaba Avatar answered Oct 29 '22 14:10

Majid Hajibaba


A greedy algorithm

Your first idea to test all the possible combinations of sentences is too slow. If you have n sentences, then there are 2**n (2 to the power of n) possible combinations of sentences. For instance with n=1000, there are 2**1000 ≈ 10**300 possible combinations. That's a 1 followed by 300 zeroes: more than the number of particles in the universe, and more than the number of different possible games of chess!

Here is a suggestion for a greedy algorithm. It's not particularly optimised, and its running time is O(k * n**2), where n is the number of sentences and k is the length of the longest sentence.

The idea is the following:

  • Attribute to each sentence the score number of useful characters - number of superfluous characters. For instance, if a sentence contains 20 'a' and the target requires only 15 'a', we're going to count 15 useful 'a' and 5 superfluous 'a', so character 'a' contributes 10 to the score of that sentence.
  • Add the sentence with the highest score to the result;
  • Update the target to remove the characters that are already in the result;
  • Update the score of every sentence to reflect the updated target.
  • Loop until no sentence has a positive score.

I was too lazy to implement it in C++, so here it is in python, using a max-heap and a Counter. After the code I wrote a quick explanation to help you translate it into C++.

from collections import Counter
import heapq

sentences = ['More RVs were seen in the storage lot than at the campground.', 'She did her best to help him.', 'There have been days when I wished to be separated from my body, but today wasn’t one of those days.', 'The swirled lollipop had issues with the pop rock candy.', 'The two walked down the slot canyon oblivious to the sound of thunder in the distance.', 'Acres of almond trees lined the interstate highway which complimented the crazy driving nuts.', 'He is no James Bond; his name is Roger Moore.', 'The tumbleweed refused to tumble but was more than willing to prance.', 'She was disgusted he couldn’t tell the difference between lemonade and limeade.', 'He didn’t want to go to the dentist, yet he went anyway.']

target = Counter('abcdefghijklmnopqrstuvwxyz' * 10)
Counter({'a': 10, 'b': 10, 'c': 10, 'd': 10, 'e': 10, 'f': 10, 'g': 10, 'h': 10, 'i': 10, 'j': 10, 'k': 10, 'l': 10, 'm': 10, 'n': 10, 'o': 10, 'p': 10, 'q': 10, 'r': 10, 's': 10, 't': 10, 'u': 10, 'v': 10, 'w': 10, 'x': 10, 'y': 10, 'z': 10})

print(target)

counts = [Counter(''.join(filter(str.isalpha, s)).lower()) for s in sentences]  # remove punctuation, spaces, uncapitalize, then count frequencies

def get_score(sentence_count, target):
    return sum((sentence_count & target).values()) - sum((sentence_count - target).values())

candidates = []
for sentence, count in zip(sentences, counts):
    score = get_score(count, target)
    candidates.append((-score, sentence, count))

heapq.heapify(candidates)    # order candidates by score
                             # python's heapq only handles min-heap
                             # but we need a max-heap
                             # so I added a minus sign in front of every score

selection = []
while candidates and candidates[0][0] < 0:  # while there is a candidate with positive score
    score, sentence, count = heapq.heappop(candidates)  # greedily selecting best candidate
    selection.append(sentence)
    target = target - count                             # update target by removing characters already accounted for
    candidates = [(-get_score(c,target), s, c) for _,s,c in candidates]  # update scores of remaining candidates
    heapq.heapify(candidates)                       # reorder candidates according to new scores

# HERE ARE THE SELECTED SENTENCES:
print(selection)
# ['Acres of almond trees lined the interstate highway which complimented the crazy driving nuts.', 'There have been days when I wished to be separated from my body, but today wasn’t one of those days.']

# HERE ARE THE TOTAL FREQUENCIES FOR THE SELECTED SENTENCES:
final_frequencies = Counter(filter(str.isalpha, ''.join(selection).lower()))
print(final_frequencies)
# Counter({'e': 22, 't': 15, 'a': 12, 'h': 11, 's': 10, 'o': 10, 'n': 10, 'd': 10, 'i': 9, 'r': 8, 'y': 7, 'm': 5, 'w': 5, 'c': 4, 'b': 4, 'f': 3, 'l': 3, 'g': 2, 'p': 2, 'v': 2, 'u': 2, 'z': 1})

# CHARACTERS IN EXCESS:
target = Counter('abcdefghijklmnopqrstuvwxyz' * 10)
print(final_frequencies - target)
# Counter({'e': 12, 't': 5, 'a': 2, 'h': 1})

# CHARACTERS IN DEFICIT:
print(target - final_frequencies)
# Counter({'j': 10, 'k': 10, 'q': 10, 'x': 10, 'z': 9, 'g': 8, 'p': 8, 'u': 8, 'v': 8, 'f': 7, 'l': 7, 'b': 6, 'c': 6, 'm': 5, 'w': 5, 'y': 3, 'r': 2, 'i': 1})

Explanations:

  • Python's Counter( ) transforms a sentence into a map character -> frequency;
  • For two Counters a and b, a & b is multiset-intersection, and a - b is multiset-difference;
  • For a Counter a, sum(a.values()) is the total count (the sum of all frequencies);
  • heapq.heapify transforms a list into a min-heap, which is a data structure that allows easy access to the element with minimum score. We actually want the sentence with maximum score, not minimum, so I replaced all the scores with negative numbers.

Non-optimality of the greedy algorithm

I should mention that this greedy algorithm is an approximation algorithm. At every iteration, it chooses the sentence with the highest score; but there is no guarantee that the optimal solution actually contains that sentence.

It is easy to build an example where the greedy algorithm fails to find the optimal solution:

target = Counter('abcdefghijklmnopqrstuvwxyz')
print(target)
# Counter({'a': 1, 'b': 1, 'c': 1, 'd': 1, 'e': 1, 'f': 1, 'g': 1, 'h': 1, 'i': 1, 'j': 1, 'k': 1, 'l': 1, 'm': 1, 'n': 1, 'o': 1, 'p': 1, 'q': 1, 'r': 1, 's': 1, 't': 1, 'u': 1, 'v': 1, 'w': 1, 'x': 1, 'y': 1, 'z': 1})

sentences = [
    'The quick brown fox jumps over the lazy dog.',
    'abcdefghijklm',
    'nopqrstuvwxyz'
]

With this target, the scores are as follow:

[
    (17, 'The quick brown fox jumps over the lazy dog.'),
    (13, 'abcdefghijklm'),
    (13, 'nopqrstuvwxyz')
]

The two "half-alphabets" have a score of 13 each, because they contain 13 letters of the alphabet. The sentence "The quick brown fox..." has a score of 17 = 26 - 9, because it contains the 26 letters of the alphabets, plus 9 excess letters (for instance, there are 3 excess 'o' and 2 excess 'e').

The optimal solution, obviously, is to cover the target perfectly with the two halves of the alphabet. But our greedy algorithm will select the "quick brown fox" sentence first, because it has a higher score.

like image 35
Stef Avatar answered Oct 29 '22 14:10

Stef