I have a set, sentences
, which contains sentences from the English language in the form of strings. I wish to create a subset of sentences
, sentences2
, which contains sentences containing only 20 unique words. Of course, there are many, many such subsets, but I'm looking for the "best" one and by "best" I mean that subset where all words have the highest possible representation in sentences2
.
The following example, will further clarify what I mean by "best":
If I was to filter sentences
for this set of words:
(i,you,do,think,yes,dont,can,it,good,cant,but,am,why,where,now,no,know,here,feel,are)
I would get the following:
sentences2 = set(("where are you now", "here i am", "can you do it", "yes i can", "but can i do it", "no you cant", "do you feel good", "yes i do", "why are you here", "i dont know", "i think i know why", "you dont think", "yes i do", "no you dont", "i dont think you think", "i feel good", "but i am good", "i cant do it now", "yes you can", "but i cant", "where do you think i am"))
and here each word is represented at least twice, as we can see if we use a counter on sentences2:
c = collections.Counter({'i': 13, 'you': 10, 'do': 6, 'think': 5, 'dont': 4, 'can': 4, 'good': 3, 'but': 3, 'am': 3, 'it': 3, 'cant': 3, 'yes': 3, 'know': 2, 'no': 2, 'here': 2, 'why': 2, 'feel': 2, 'are': 2, 'now': 2, 'where': 2})
If each word is represented at least twice we can say that this set of 20 words has a score of 2.
score = min(c.values())
However, the following set:
(i,you,he,do,think,yes,dont,can,it,good,cant,but,am,why,where,now,no,here,she,are)
has a score of 5, since if I use it to filter sentences
, I get a sentences2
where each word is represented at least five times.
So I'm after the highest possible score for all possible 20 word combinations.
Here is my attempt at solving this problem:
sentences = ... # all the sentences in my text
common_words = ... # the hundred most common words in the text
result_size = 20
highest_score = 0
for sample in itertools.combinations(common_words, result_size):
sentences2 = list(filter(lambda s: set(s).issubset(sample), sentences))
c = Counter([j for i in sentences2 for j in i])
if len(c.values()) and min(c.values()) > highest_score:
# this is the set with the highest score to date
print(c)
highest_score = min(c.values())
However, this algorithm will take forever to compute, with 5.3598337040381E+20 combinations if I'm not mistaken. Can you suggest how I might go about solving this with a much faster algorithm?
Please note that the resulting set can contain less than 20 words and that this is completely fine. For example, c.values()
in my algorithm does not have to match the size of result_size
.
Also note that I'm expecting the words in the resulting set to be found in the top one hundred words (common_words
contains 100 values). This is also by design.
Disclaimer: You have not specified data characteristics, so my answer will assume that it is not too large(more than 1,000,000 sentences, each at most 1,000). Also Description is a bit complicated and I might have not understood the problem fully.
Solution:
Instead of focusing on different combinations, why don't you create a hashMap(dict
in python) for your 100 most frequently used words, then traverse the array of senteces and for each word in each sentence, increase its corresponding value(if it is already inside the dict).
In the end, just sort this hashMap according to the number of occurences(value) of each word(key), then use the most frequent 20.
Complexity:
A quick look at the algorithms, gives:
Traversing N sentences, traversing each with M words, increasing the hashMap value. at the end sorting an array of (word, occurrences) pairs. which is negligible(hashMap size is constant, 100 frequently used words), and extracting first 20.
Time Complexity : O(N*M)
Space complexity : O(1) (we don't need to store the sentences, we just have the hashMap)
Sample Code:
Here is a quick psuedo-code:
word_occur_dict = {#initialized with frequent words as keys, and zero as value for all}
for sentence in sentences: #for each sentence
sentence_words = sentence.split(" ") #construct the word list
for word in sentence_words: #for each word
if word in word_occur_dict: #if it is a frequent word, increase value
word_occur_dict[word]++
final_result = sort_dict(word_occur_dict)[:20] #returns list of tuples
Python Code:
import operator
common_words = ["do","think","yes","dont","can","it","good","cant","but","am","why","where","now","no","know","here","feel","are","i","you","he","she"]
common_words_dict = {}
sentences = ["where are you now", "here i am", "can you do it", "yes i can", "but can i do it", "no you cant", "do you feel good", "yes i do", "why are you here", "i dont know", "i think i know why", "you dont think", "yes i do", "no you dont", "i dont think you think", "i feel good", "but i am good", "i cant do it now", "yes you can", "but i cant", "where do you think i am"]
for w in common_words: #initialize the dict
common_words_dict[w] = 0
for sentence in sentences: #for each sentence
sentence_words = sentence.split(" ") #construct the word list
for word in sentence_words: #for each word
if word in common_words_dict: #if it is a frequent word, increase value
common_words_dict[word] = common_words_dict[word]+1
sorted_word_dict = sorted(common_words_dict.items(), key=operator.itemgetter(1))
print sorted_word_dict[::-1][:20]
By the way, 'he' and 'she' does not appear anywhere on the sentences, but you said the following word combination has a score of 5
(i,you,he,do,think,yes,dont,can,it,good,cant,but,am,why,where,now,no,here,she,are)
Have I misunderstood the problem.
Credit where it is due: StackOverflow: Sort a Python dictionary by value
STEP 1 should be to create a data structure that has only the words in sentences that appear in common_words. The structure can also have the number of times the word appears and a set of integers that references which sentences' the word is in.
counts[..., {
word:string,
count:number,
ids:Set<number>
}, ...]
Some pseudo code
countsMap = Map()
For i = 0 To sentences.Size - 1
sentence = sentences[i]
For Each word in sentence
If Not countsMap.Contains(word) Then
countsMap.Add(word, {word:word, count:0, ids:Set()})
End If
value = wordMap.Get(word)
If Not value.ids.Contains(i) Then
value.Count++
value.ids.Add(i)
countsMap[word] = value
End If
Next
Next
counts = countsMap.Values
Idealistic STEP 2 If you're lucky and your counts data type contains < 40 entries you can do an exhaustive search of C(n, 20) combinations in a reasonable amount of time with a single computer C(38, 20) ~= 33billion. This would involve iterating over the combinations and intersecting the ids sets together, the final set size is your minimum score.
Some pseudo code
bestScore = 0
bestCombo = null
For Each combo in Combinations(counts, 20)
score = combo.Reduce((prev, curr) => prev.ids.Intersect(curr.ids)).Size
If bestScore < score Then
bestScore = score
bestCombo = combo
End If
Next
Realistic STEP 2 In most cases your counts will contain mmany more than 40 unique words in which case you'll have to settle for a best guess / approximation. I would probably do something like, use the above code but instead of Pick 20 use Pick 2, sort your results descending by the score and take 10.
Some pseudo code
list = []
For Each combo in Combinations(counts, 2)
score = combo[0].ids.Intersect(combo[1].ids).Size
list.Add( { score:score, words:[ combo[0].word, combo[1].word ] } )
Next
// sort descending by score
list.Sort((a, b) => b.score - a.score)
// grab the 20 best words
result = Set()
i = 0
While result.Size < 20
result.Add(list[i].words[0])
result.Add(list[i].words[1])
i = i + 1
End While
Will you get a final score greater than 1? Statistically that would depend on how many unique words and sentences there are, but probably not.
EDIT An implementation note and correction. Intersecting the sets of sentence ids that the words appear in will give you a minimum score minus 1 (zero indexed). For example, "Dog" is in sentences 1 and 2; "Cat" is in sentences 2 and 3"; "Frog" is in sentence 4; the intersection of [1,2] /\ [2,3] /\ [4] = [] but the minimum score is 1 result.Size() + 1. In the same way just "Dog" and "Cat" [1,2] /\ [2,3] = [2] has a set size of 1 but the minimum score is 2.
May I suggest you try the approach below. I have no mathematical proof that it produces the absolutely best "score", or indeed some other "measure of goodness", but I do believe it may help you out, unless, of course, the case requires that you actually prove and find the absolutely best.
I don't speak python, but the strategy and ensuing implementation are simple enough to not even require pseudo-code to explain. I will use quite a few words though, not because it is very complicated, but to be clear (I hope).
You will need a method to diagonalize a matrix, or at least to find the eigenvector corresponding to the largest eigenvalue. Not every language provides natively a method for that. In C++ you might use the Eigen library. In C#, which I used to test, I hooked up MathNet. In python, I've no idea. The requirements for the diagonalization/eigenvector facility are limited: the matrix is always real, symmetric, all elements are positive. However, although the diagonal elements are at least as big as any other in their row/column, the matrix is certainly not "diagonal-dominant".
The problem you present can in abstracto be formulated for integers, not words, which makes it more convenient (for me...) to formulate and implement (but otherwise, it is quite irrelevant). In fact, we only need your "common_words", hence 100 integers [0..99], and you set up once the map between words and integers simply by putting the words in an array or list and using their index.
As a side note: the strategy is likely (much) more adequate for your language application than for a completely generalized problem with integers where you might construct "maliciously" difficult input. The reason is that we'll essentially exploit pair-wise correlations, if there are any, between the items (words within the same sentence), whereas dominantly strong correlations for triples, quadruples, ... might make it less efficient (but we'll do something for the triples). Obviously, in well-formed language you'd expect correlations: "am" is often to show up together with "I", and whereas "have" may in total be more frequent than "has", as soon as you find "he" you might be more likely to also get a "has" than "have" in the same sentence. And so on.
A few definitions are in order.
NCommon is the number of elements in your common_words (i.c. 100). The common_words "are" the integers [0..NCommon-1].
NMaxSet is your problem-limit (i.c. 20) the maximum number of 'words' you are willing to use, finally.
F is a filter: the collection of 'words' that you're using to define which sentences are included in some collection. In the end you must have adapted your filter such that F.Count does not exceed NMaxSet.
M(N) is a square NxN matrix; row and column indexes are in [0..N-1]
S(F) is the collection of sentences (from the problem input) that satisfy a filter F. A "sentence" is here always a collection of integers in the range [0..NCommon-1], see above. Duplicate words in the same sentence are irrelevant (problem description) and such duplications are removed from our integer-sentences.
Here we go.
I. Preparation.
Initialize a matrix MCommon = M(NCommon). Create the FCommon filter that contains all the common_words.
Filter the input, removing duplicate words in sentences. This produces a sentences set S0 = S(FCommon), which is your "real" input: everything that's irrelevant has been removed.
On-the fly, while accepting sentences into S0, fill the matrix MCommon: {for (j in sentence): for(k in sentence): M(j,k)+=1}. The matrix is symmetric, so you may just fill the upperright triangle and mirror at the end into the lowerleft triangle.
Having completed the scan, M[j,k] is the number of k-occurrences in 'sentences' that contain j (and vice-versa: the matrix is symmetric). M[k,k] is the total count of k in all sentences in S together. M is the (pair-wise) correlation matrix, telling you how likely it is that a {j,k} combination occurs in the underlying sentences set S.
(Always, when dealing with such matricies: if there happen to be, at the end, empty columns (and hence rows): remove these columns and rows, simultaneously removing the corresponding entry in the Filter that underlies the matrix, as obviously it doesn't play a role. Henceforth we'll assume this has been done (unless stated otherwise), i.e. no column/row in the matrix is empty.)
II. Compute result (main approach, we'll refine below):
Compute the eigenvector E of MCommon that corresponds to the highest eigenvalue: E is an array (vector) with NCommon coefficients.
Set NTarget=NSetMax
Determine which are the NTarget biggest coefficients in E. We're not interested in their values, but in their indexes. Put these indexes together: they define a new filter F(NTarget).
Run S0 through the new filter to produce STarget. Count all 'word'-occurrences, find their minimum: that's your "set-quality" value. You may do so, for instance, by also computing the associated MTarget and scanning the diagonal values MTarget[k,k]. That may seem to involve unnecessary additional effort, as you're computing also the off-diagonal elements, but we'll see that MTarget may be handy in the follow-up refinements.
III. Refinements.
A) We need to verify whether by removing one or a few more items from the filter F(NsetMax), reducing it to some NTarget smaller than NSetMax, we would get a better score value. This requires some care: it is very well possible that removing TWO (or 3, ...) items would improve the score, but that removing either of them would deteriorate it.
Let's take a first (and fairly reasonable) shot at this issue.
While producing STarget you also fill a new matrix MTarget (you see, it is handy...), as you did with MCommon before. The size is NTarget. Get it's largest-eigenvalue eigenvector. Determine the SMALLEST coefficient. Remove from the filter the corresponding index.
Filter again (as input you can, of course, now use the STarget collection) and compute the score. If it's better: mark/remember. In all cases (improvement or not) continue in the same fashion, reducing the filter one-by-one until you've nothing left.
B) One may expect that the reason, as briefly explained, for the "cautious" approach in further reducing the filter below NSetMax - one at a time - might also apply to some extent to the first step where we reduce F(NCommon) to F(NTarget=NMaxSet) in one big strike.
To accomodate this, we may want to go from NCommon down to NMaxSet in steps. In order not to incur too much computational overhead, we won't take steps of size 1, but rather each time for instance a reduction by some 10%, or something similar.
So, in II above, do not set NTarget immediately to NSetMax, but (for instance) set NTarget = 90. Construct the corresponding filter, apply the filter to S0, giving S1, produce also the matrix M1, get its eigenvector (largest eigenvalue). Repeat: set NTarget=80, 70, 60 and so on. In later stages you may get even more cautious, lowering 40 to 35 to 30 to 28 to 26 ... In each step you build upon and use the results from the previous, to optimally exploit the reduction in size and computational effort.
All the time you want to monitor whether or not the largest NSetMax (final value in this part of the algorithm) coefficients always occur with the same indexes or not. This will give some information regarding whether the final result is reliable: the stability, or not, of the expected final filter against the algorithmic path towards it.
C) Consider the situation where we have reduced to NSetMax and investigate whether or not it should be reduced further in a one-at-a-time approach. (The same may apply in the final stages of (B), if, when approaching NSetMax from above, you go "one-at-a-time" as soon as you get close to NSetMax.)
Suppose there are during this phase of the algorithm, in your (first or a later) STarget collection, certain PAIRS of 'words', such that removing such specific pair from the filter would improve things, while neither of their individual coefficients in the eigenvector is the smallest. I'm not sure whether or not this is likely or even possible, but let's see how we might handle it. If (your) tests show that it's irrelevant, you may in a final implementation remove this feature from the algorithm.
Assume we have some NTarget, an associated filter FTarget and matrix MTarget. (The order of the items ('words') in the filter always equals (of course) the order of the rows and columns in the matrix.)
From MTarget we can directly deduce some information about what would happen if we were to remove the j-th item from the filter. Of course, the j-th row and j-th column in the matrix become empty (all zero's). More interestingly, M[j,k] presents how many times item k occurs in all sentences that contain item j together. So, when eliminating all j (by removing it from the filter), we know in advance that in the resulting new matrix MReducted, element MReduced[k,k] will diminish precisely by that M[j,k] value. (By the way, this provides an alternative way of determining which item to remove (if you opt to remove only one): the minimum{j: M[j,j]} is the score of the associated set S, and from M we can compute how ALL diagonal elements would change upon removing a particular item j, enabling a precomputation of the resulting score)
Here, however, we would like to know how the score would be affected upon removal of some PAIR of items, j and k. We now need to determine how M[p,p] is affected for all p that are not j or k. This, you can not directly compute from M: removing j affects row k and whereas we know how it changes [k.k], and any other [p,p], we don't know how it would change [k,p], which is needed to compute how ALSO ('subsequently') removing k will change [p,p]. Note, by the way, that it must be immaterial for the final outcome whether you remove first j and then k or vice-versa, or both simultaneously.
For this we need some kind of 'derivative'. Fortunately we can compute and process that without too much computational effort (given that NTarget is already rather small, 20 or so).
Consider a 'reductionfilter' G(F; j) related to the current filter F, defined simply as processing F but ignoring therein item j. With G we compute, in the same manner as always, a 'reductionmatrix' N(G), where for convenience in the discussion (and implementation) we keep the same size (not removing empty column/row). Of course, in such N matrix, the j-th row and j-th column are empty. Its diagonal elements will have values as explained above (we could have computed these from M directly). However, now we also have all off-diagonal elements that result from removing j.
From N(G{F;j}) we can now determine what would happen if we remove ALSO item k, see elaboration above about how to get the expected score from a current matrix. So, this allows computing the score upon removal of the pair {j,k}.
If the setsize is 20 we have to compute 20 N(G(F;j)) matrices, a relatively small effort, I assume (your sentences-collection will also by now have become a lot smaller than originally). Then, having all N, compute for each of the 20*19/2 unique pairs the resulting pair-removal-scores and you're in the position to choose which PAIR (rather than individual element) to remove from a filter. You may compare that, on the fly, against reduction by 'one-at-a-time' and make appropriate choices how to systematically reduce the filter in search for a better score. There are many ways to go about this comparison. A relatively simple one would be: calculate a reduction from first a pair and then a single one (in total 3 elements). Compare against first removing a single one and then a pair. Pick the best of these two, but perform from that choice only the first step (either a single or a pair) and repeat the procedure.
Note that using these "derived" filters G, matrices N and the ensuing score-precalculations as explained, you effectively bring in correlations between TRIPLES of items: you determine what happens for all {j,k,p} to the frequency of p upon removing both j and k. This idea can, of course, be extended to incorporate quadruples and beyond, but (a) I don't believe it will practically help very much, and (b) the further you go on this road, the faster the computational effort will increase. I even doubt whether bringing in the "triples" as explained here is relevant, but I'm not a language expert and apart from a little additional programming effort there's no major drawback.
D) The backbone of the strategy is relying on the eigenvector with the largest eigenvalue to point to the relevant items to subsequently filter with. Of course, it may happen that two or more eigenvalues are "almost" the same and the corresponding eigenvectors may point to completely different sets of items when analyzing their largest components. In that case it will be worthwhile to "branch", that is: go with one of them, work to the end, and afterwards redo everything with their competitor(s) and see if they produce a better outcome. A problem arises here if you encounter a lot of branches (at various points on the road to a solution). This is not likely to happen, but the implementation should have some strategy to deal with it in a practical way if it does happen. I suggest that you first leave any branching out, but that you do monitor the occurrence of "competitive" largest eigenvalues so as to output a warning to the user. Alternatively you may implement a handling of such branching, but be very strict on what the programm will consider "almost equal", or put in some limit as to the number of branches that are (in total) to be investigated, thereby reducing the chance that the computationtime runs out of hand.
I have to leave it here (for lack of time...), hope I've explained sufficiently the idea, as well as some details and refinements to consider.
I've not had the time to organize for myself relevant 'real language' sentences as input and test against that. I've programmed the basic steps in C#, tested against random integer-'sentences', with some biasses to force certain 'words' to occur more often that others, but without bothering about correlations among 'words' within 'sentences'. The results appeared quite reasonable to me, but I didn't have the time to analyze it extensively.
Given the lack of structure in the 'random' sequences that I've used as input, whereas real language is expected to exhibit significant pair-wise correlations, which the strategy exploits, I have good hope that it might be a useful thing for you to have a look at.
[EDITED] Note added: in the above I've occasionally been sloppy in talking about "j", "k" and so on. I'm sorry about that. Sometimes "j" refers to the j-th entry of something (a Matrix row, an index in a filter-list), sometimes it refers to the corresponding VALUE in (usually) a filter. For instance, a filter may contain 18 items, numbered (indexes) 0..17, but their VALUES always refer to the original Common_words list, so they may be {3, 15, 29, 30, 31, 40, ...}. Then, when it says "remove j from the filter", it will usually mean "remove the j-th entry from the filter (and that entry may have any value from [0..NCommon-1]). When APPLYING a filter to a sentence, you compare the VALUES in the filter against the values in the sentence. I hope that the context - in combination with a fair understanding of the line of reasoning - always makes it clear what is really meant.
[EDITED: some test results] I've ran my C# implementation, using the above algorithm (more or less: most but not all refinements/details described) to get some impression of what it would produce.
For input I took 4 books (plain text) from the Gutenberg project. In total (only) 27k sentences, 380k words (20k different), so a rather small sample.
The list of common words as determined from that input started with "the", "of", "and", "to"... (when sorted to frequency of occurrence in the total input).
The algorithm filtered 14 words ("i", "what", "do", "you", "it", "yes", ...) to produce an "optimal" "set-quality-value" of 8 (139 sentences were found with only those 14 words).
As I was suspicious about the assumption that just 100 "common words" were to be used, I had a priori allowed 500 common words, rather than 100, and sure enough among the words that made it to the final filter, 4 were frequency-numbered above 100 ("yes", "know", "say", "think"): "yes", for instance, was #224 in the overall list if you were to sort them by occurrence in all input, presumably the base for "common words".
I'm not sure if it's possible to come up with the best solution in less than exponential time, the problem may not have enough structure. But here is a heuristic to come up with a 'good' solution.
I think the way to do this is to start with wordset
having size 0, and adding words to it one by one in a 'clever' way with a max of 20. Consider that for a given wordset_n
, the score for each individual word can only increase or stay the same when a new word is added. The only way that wordset_(n+1)
can have a lower score than wordset_n
is if the (n+1)th word brings down the minimum. So we restrict ourselves to only adding words which bring the minimum up or keep it the same (but see an elaboration below).
So to a first approximation,
common_words
by frequency in sentences
.wordset
until the score is nonzero.wordset
s constructed by only adding words which increase or maintain the score of the previous node (max N = 20). So a path down this tree will be wordset_n
, wordset_(n+1)
, wordset_(n+2)` with each node consisting of the previous node plus one new word.On a second approximation, a) you may want to try multiple combinations for step 2. 100 chose 3 is 162000 which is not out of the question. b) Additionally, for step 3, you may want to look ahead two steps, not just one--i.e., only reject a word for spot n+1 in wordset
if, after all possibilities for word n+2, the score is still lower than wordset_n
. This might help if there are anti-correlations, i.e. a short sentence with one verb is unlikely to contain another. Lastly c) if this method still is computationally prohibitive you can constrain tree construction still further by closing off branches where the (n+1)th doesn't raise the score by a given amount.
If the distributions of words in sentences
are relatively independent of one another, or only exhibit positive correlations, this method may get you something pretty close to optimal.
I would look for a form of data clustering, to speed up the search across each 20 word trial.
The data which matters is the unique words in a sentence.
2 sentences can be considered close if the jaccard distance is small. The jaccard distance is 1 - (size of(intersection of words in sentences))/( size of( union of words in sentences)).
Because the jaccard distance is metric (meets triangle inequality), a m-tree index can be built which allows faster finding of sentences.
From this basis clustering can occur ( the best results will be close to each other).
Finally by iterating through each sentence, and finding the K nearest neighbours, you can find a set of up to 20 words which could be used. This will give a different value for K (sentences which share best 20 words).
I would probably use a database to support this select union
and select intersection
allowing set testing
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With