This came up in another question but I figured it is best to ask this as a separate question. Give a large list of sentences (order of 100 thousands): <pre class="prettyprint"><code>[ "This is sentence 1 as an example", "This is sentence 1 as another example", "This is sentence 2", "This is sentence 3 as another example ", "This is sentence 4" ] </code></pre> what is the best way to code the following function? <pre class="prettyprint"><code>def GetSentences(word1, word2, position): return "" </code></pre> where given two words, <code>word1</code>, <code>word2</code> and a position <code>position</code>, the function should return the list of all sentences satisfying that constraint. For example: <pre class="prettyprint"><code>GetSentences("sentence", "another", 3) </code></pre> should return sentences <code>1</code> and <code>3</code> as the index of the sentences. My current approach was using a dictionary like this: <pre class="prettyprint"><code>Index = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: []))) for sentenceIndex, sentence in enumerate(sentences): words = sentence.split() for index, word in enumerate(words): for i, word2 in enumerate(words[index:): Index[word][word2][i+1].append(sentenceIndex) </code></pre> But this quickly blows everything out of proportion on a dataset that is about 130 MB in size as my 48GB RAM is exhausted in less than 5 minutes. I somehow get a feeling this is a common problem but can't find any references on how to solve this efficiently. Any suggestions on how to approach this?

Use database for storing values. <ol> <li>First add all the sentences to one table (they should have IDs). You may call it eg. <code>sentences</code>.</li> <li>Second, create table with words contained within all the sentences (call it eg. <code>words</code>, give each word an ID), saving connection between sentences' table records and words' table records within separate table (call it eg. <code>sentences_words</code>, it should have two columns, preferably <code>word_id</code> and <code>sentence_id</code>).</li> <li> When searching for sentences containing all the mentioned words, your job will be simplified: <ol> <li> You should first find records from <code>words</code> table, where words are exactly the ones you search for. The query could look like this: <pre class="prettyprint"><code>SELECT `id` FROM `words` WHERE `word` IN ('word1', 'word2', 'word3'); </code></pre> </li> <li> Second, you should find <code>sentence_id</code> values from table <code>sentences</code> that have required <code>word_id</code> values (corresponding to the words from <code>words</code> table). The initial query could look like this: <pre class="prettyprint"><code>SELECT `sentence_id`, `word_id` FROM `sentences_words` WHERE `word_id` IN ([here goes list of words' ids]); </code></pre> which could be simplified to this: <pre class="prettyprint"><code>SELECT `sentence_id`, `word_id` FROM `sentences_words` WHERE `word_id` IN ( SELECT `id` FROM `words` WHERE `word` IN ('word1', 'word2', 'word3') ); </code></pre> </li> <li>Filter the result within Python to return only <code>sentence_id</code> values that have all the required <code>word_id</code> IDs you need.</li> </ol> </li> </ol> This is basically a solution based on storing big amount of data in the form that is best suited for this - the database. EDIT: <ol> <li>If you will only search for two words, you can do even more (almost everything) on DBMS' side.</li> <li>Considering you need also position difference, you should store the position of the word within third column of <code>sentences_words</code> table (lets call it just <code>position</code>) and when searching for appropriate words, you should calculate difference of this value associated with both words.</li> </ol>

Most efficient way to index words in a document?

Tags:

python

text

nlp

This came up in another question but I figured it is best to ask this as a separate question. Give a large list of sentences (order of 100 thousands):

[
"This is sentence 1 as an example",
"This is sentence 1 as another example",
"This is sentence 2",
"This is sentence 3 as another example ",
"This is sentence 4"
]

what is the best way to code the following function?

def GetSentences(word1, word2, position):
    return ""

where given two words, word1, word2 and a position position, the function should return the list of all sentences satisfying that constraint. For example:

GetSentences("sentence", "another", 3)

should return sentences 1 and 3 as the index of the sentences. My current approach was using a dictionary like this:

Index = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: [])))

for sentenceIndex, sentence in enumerate(sentences):
    words = sentence.split()
    for index, word in enumerate(words):
        for i, word2 in enumerate(words[index:):
            Index[word][word2][i+1].append(sentenceIndex)

But this quickly blows everything out of proportion on a dataset that is about 130 MB in size as my 48GB RAM is exhausted in less than 5 minutes. I somehow get a feeling this is a common problem but can't find any references on how to solve this efficiently. Any suggestions on how to approach this?

651

asked Nov 05 '11 01:11

Legend

2 Answers

Use database for storing values.

First add all the sentences to one table (they should have IDs). You may call it eg. sentences.
Second, create table with words contained within all the sentences (call it eg. words, give each word an ID), saving connection between sentences' table records and words' table records within separate table (call it eg. sentences_words, it should have two columns, preferably word_id and sentence_id).
When searching for sentences containing all the mentioned words, your job will be simplified:
1. You should first find records from words table, where words are exactly the ones you search for. The query could look like this:
```
SELECT `id` FROM `words` WHERE `word` IN ('word1', 'word2', 'word3');
```
2. Second, you should find sentence_id values from table sentences that have required word_id values (corresponding to the words from words table). The initial query could look like this:
```
SELECT `sentence_id`, `word_id` FROM `sentences_words`
WHERE `word_id` IN ([here goes list of words' ids]);
```
  which could be simplified to this:
```
SELECT `sentence_id`, `word_id` FROM `sentences_words`
WHERE `word_id` IN (
    SELECT `id` FROM `words` WHERE `word` IN ('word1', 'word2', 'word3')
);
```
3. Filter the result within Python to return only sentence_id values that have all the required word_id IDs you need.

This is basically a solution based on storing big amount of data in the form that is best suited for this - the database.

EDIT:

If you will only search for two words, you can do even more (almost everything) on DBMS' side.
Considering you need also position difference, you should store the position of the word within third column of sentences_words table (lets call it just position) and when searching for appropriate words, you should calculate difference of this value associated with both words.

109

answered Sep 26 '22 16:09

Tadeck

Here's how I did it in Python. Though assuming this needs to be done more than once, a DBMS is the right tool for the job. However this seems to work pretty well for me with a million rows.

sentences = [
    "This is sentence 1 as an example",
    "This is sentence 1 as another example",
    "This is sentence 2",
    "This is sentence 3 as another example ",
    "This is sentence 4"
    ]

sentences = sentences * 200 * 1000

sentencesProcessed = []

def preprocess():
    global sentences
    global sentencesProcessed
    # may want to do a regex split on whitespace
    sentencesProcessed = [sentence.split(" ") for sentence in sentences]

    # can deallocate sentences now
    sentences = None


def GetSentences(word1, word2, position):
    results = []
    for sentenceIndex, sentence in enumerate(sentencesProcessed):
        for wordIndex, word in enumerate(sentence[:-position]):
            if word == word1 and sentence[wordIndex + position] == word2:
                results.append(sentenceIndex)
    return results

def main():
    preprocess()
    results = GetSentences("sentence", "another", 3)
    print "Got", len(results), "results"

if __name__ == "__main__":
    main()

answered Sep 22 '22 16:09

shookster

Related questions
                            
                                Converting a StringIO object to a Django ImageFile
                            
                                Clicking a button automatically in a web browser with python
                            
                                How to unittest command line arguments?
                            
                                Python distutils gcc path
                            
                                How to parse this format (Praat TextGrid)
                            
                                Deciding which exceptions to catch in Python
                            
                                Python and performance of list comprehensions
                            
                                django form radio input layout
                            
                                python matplotlib colorbar setting tick formator/locator changes tick labels
                            
                                How can I detect double click events in matplotlib?
                            
                                How do I do a get_or_create in pymongo (python mongodb)?
                            
                                Prebuilt GVim 7.3. binaries with +python support
                            
                                PYTHON Trouble using escape key to exit
                            
                                Python Read Formatted String
                            
                                Nested for loops in Python
                            
                                Django admin model Inheritance is it possible?
                            
                                Python using STDIN in child Process
                            
                                Import error ft2font from matplotlib (python, macosx)
                            
                                Why is __init__.py not being called?
                            
                                Error installing psycopg2 in cygwin

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With