Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best algorithm to index sentences

Imagine I have a situation where I need to index sentences. Let me explain it a little bit deeper.

For example I have these sentences:

  1. The beautiful sky.
  2. Beautiful sky dream.
  3. Beautiful dream.

As far as I can imagine the index should look something like this:

alt text http://img7.imageshack.us/img7/4029/indexarb.png

But also I would like to do search by any of these words.

For example, if I do search by "the" It should show give me connection to "beautiful". if I do search by "beautiful" it should give me connections to (previous)"The", (next)"sky" and "dream". If I search by "sky" it should give (previous) connection to "beautiful" and etc...

Any Ideas ? Maybe you know already existing algorithm for this kind of problem ?

like image 748
Lukas Šalkauskas Avatar asked Apr 08 '09 06:04

Lukas Šalkauskas


3 Answers

Short Answer

Create a struct with two vectors of previous/forward links. Then store the word structs in a hash table with the key as the word itself.

Long Answer

This is a linguistic parsing problem that is not easily solved unless you don't mind gibberish.

  1. I went to the park basketball court.
  2. Would you park the car.

Your linking algorithm will create sentences like:

  1. I went to the park the car.
  2. Would you park basketball court.

I'm not quite sure of the SEO applications of this, but I would not welcome another gibberish spam site taking up a search result.

like image 58
Unknown Avatar answered Nov 16 '22 11:11

Unknown


I imagine you would want some sort of Inverted index structure. You would have a Hashmap with the words as keys pointing to lists of pairs of the form (sentence_id, position). You would then store your sentences as arrays or linked lists. Your example would look like this:

sentence[0] = ['the','beautiful', 'sky'];
sentence[1] = ['beautiful','sky', 'dream'];
sentence[2] = ['beautiful', 'dream'];

inverted_index = 
{
 'the': {(0,0)},
 'beautiful': {(0,1), (1,0), (2,0)},
 'sky' : {(0,2),(1,1)},
 'dream':{(1,2), (2,1)}
};

Using this structure lookups on words can be done in constant time. Having identified the word you want, finding the previous and subsequent word in a given sentence can also be done in constant time.

Hope this helps.

like image 2
Il-Bhima Avatar answered Nov 16 '22 13:11

Il-Bhima


You can try and dig into Markov chains, formed from words of sentences. Also you'll require both-way chain (i.e. to find next and previous words), i.e. store probable words that appear just after the given or just before it.

Of course, Markov chain is a stochastic process to generate content, however similar approach may be used to store information you need.

like image 1
dragonfly Avatar answered Nov 16 '22 12:11

dragonfly