Best algorithm to index sentences

Question

Imagine I have a situation where I need to index sentences. Let me explain it a little bit deeper.

For example I have these sentences:

The beautiful sky.
Beautiful sky dream.
Beautiful dream.

As far as I can imagine the index should look something like this:

alt text http://img7.imageshack.us/img7/4029/indexarb.png

But also I would like to do search by any of these words.

For example, if I do search by "the" It should show give me connection to "beautiful". if I do search by "beautiful" it should give me connections to (previous)"The", (next)"sky" and "dream". If I search by "sky" it should give (previous) connection to "beautiful" and etc...

Any Ideas ? Maybe you know already existing algorithm for this kind of problem ?

Unknown · Accepted Answer

Short Answer

Create a struct with two vectors of previous/forward links. Then store the word structs in a hash table with the key as the word itself.

Long Answer

This is a linguistic parsing problem that is not easily solved unless you don't mind gibberish.

I went to the park basketball court.
Would you park the car.

Your linking algorithm will create sentences like:

I went to the park the car.
Would you park basketball court.

I'm not quite sure of the SEO applications of this, but I would not welcome another gibberish spam site taking up a search result.

Il-Bhima · Answer

I imagine you would want some sort of Inverted index structure. You would have a Hashmap with the words as keys pointing to lists of pairs of the form (sentence_id, position). You would then store your sentences as arrays or linked lists. Your example would look like this:

sentence[0] = ['the','beautiful', 'sky'];
sentence[1] = ['beautiful','sky', 'dream'];
sentence[2] = ['beautiful', 'dream'];

inverted_index = 
{
 'the': {(0,0)},
 'beautiful': {(0,1), (1,0), (2,0)},
 'sky' : {(0,2),(1,1)},
 'dream':{(1,2), (2,1)}
};

Using this structure lookups on words can be done in constant time. Having identified the word you want, finding the previous and subsequent word in a given sentence can also be done in constant time.

Hope this helps.

dragonfly · Answer

You can try and dig into Markov chains, formed from words of sentences. Also you'll require both-way chain (i.e. to find next and previous words), i.e. store probable words that appear just after the given or just before it.

Of course, Markov chain is a stochastic process to generate content, however similar approach may be used to store information you need.

Best algorithm to index sentences

Tags:

algorithm

indexing

search

seo

Lukas Šalkauskas

3 Answers

Unknown

Il-Bhima

dragonfly

Recent Activity

Donate For Us

Best algorithm to index sentences

Tags:

algorithm

indexing

search

seo

Lukas Šalkauskas

3 Answers

Unknown

Il-Bhima

dragonfly

Related questions

Recent Activity

Donate For Us