This came up in another question but I figured it is best to ask this as a separate question. Give a large list of sentences (order of 100 thousands):
[
"This is sentence 1 as an example",
"This is sentence 1 as another example",
"This is sentence 2",
"This is sentence 3 as another example ",
"This is sentence 4"
]
what is the best way to code the following function?
def GetSentences(word1, word2, position):
return ""
where given two words, word1
, word2
and a position position
, the function should return the list of all sentences satisfying that constraint. For example:
GetSentences("sentence", "another", 3)
should return sentences 1
and 3
as the index of the sentences. My current approach was using a dictionary like this:
Index = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: [])))
for sentenceIndex, sentence in enumerate(sentences):
words = sentence.split()
for index, word in enumerate(words):
for i, word2 in enumerate(words[index:):
Index[word][word2][i+1].append(sentenceIndex)
But this quickly blows everything out of proportion on a dataset that is about 130 MB in size as my 48GB RAM is exhausted in less than 5 minutes. I somehow get a feeling this is a common problem but can't find any references on how to solve this efficiently. Any suggestions on how to approach this?
Create the index Click where you want to add the index. On the References tab, in the Index group, click Insert Index. In the Index dialog box, you can choose the format for text entries, page numbers, tabs, and leader characters. You can change the overall look of the index by choosing from the Formats dropdown menu.
Yes, you can create an index for multiple word documents. Select Outline view and go to Outlining tab > Show Document > Insert, and insert all the documents. Now switch back to Print Layout view and go to References tab > Table of Contents, and choose the desired option.
Use database for storing values.
sentences
.words
, give each word an ID), saving connection between sentences' table records and words' table records within separate table (call it eg. sentences_words
, it should have two columns, preferably word_id
and sentence_id
).When searching for sentences containing all the mentioned words, your job will be simplified:
You should first find records from words
table, where words are exactly the ones you search for. The query could look like this:
SELECT `id` FROM `words` WHERE `word` IN ('word1', 'word2', 'word3');
Second, you should find sentence_id
values from table sentences
that have required word_id
values (corresponding to the words from words
table). The initial query could look like this:
SELECT `sentence_id`, `word_id` FROM `sentences_words`
WHERE `word_id` IN ([here goes list of words' ids]);
which could be simplified to this:
SELECT `sentence_id`, `word_id` FROM `sentences_words`
WHERE `word_id` IN (
SELECT `id` FROM `words` WHERE `word` IN ('word1', 'word2', 'word3')
);
Filter the result within Python to return only sentence_id
values that have all the required word_id
IDs you need.
This is basically a solution based on storing big amount of data in the form that is best suited for this - the database.
EDIT:
sentences_words
table (lets call it just position
) and when searching for appropriate words, you should calculate difference of this value associated with both words.Here's how I did it in Python. Though assuming this needs to be done more than once, a DBMS is the right tool for the job. However this seems to work pretty well for me with a million rows.
sentences = [
"This is sentence 1 as an example",
"This is sentence 1 as another example",
"This is sentence 2",
"This is sentence 3 as another example ",
"This is sentence 4"
]
sentences = sentences * 200 * 1000
sentencesProcessed = []
def preprocess():
global sentences
global sentencesProcessed
# may want to do a regex split on whitespace
sentencesProcessed = [sentence.split(" ") for sentence in sentences]
# can deallocate sentences now
sentences = None
def GetSentences(word1, word2, position):
results = []
for sentenceIndex, sentence in enumerate(sentencesProcessed):
for wordIndex, word in enumerate(sentence[:-position]):
if word == word1 and sentence[wordIndex + position] == word2:
results.append(sentenceIndex)
return results
def main():
preprocess()
results = GetSentences("sentence", "another", 3)
print "Got", len(results), "results"
if __name__ == "__main__":
main()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With