Imagine I have a situation where I need to index sentences. Let me explain it a little bit deeper.
For example I have these sentences:
As far as I can imagine the index should look something like this:
alt text http://img7.imageshack.us/img7/4029/indexarb.png
But also I would like to do search by any of these words.
For example, if I do search by "the" It should show give me connection to "beautiful". if I do search by "beautiful" it should give me connections to (previous)"The", (next)"sky" and "dream". If I search by "sky" it should give (previous) connection to "beautiful" and etc...
Any Ideas ? Maybe you know already existing algorithm for this kind of problem ?
Short Answer
Create a struct with two vectors of previous/forward links. Then store the word structs in a hash table with the key as the word itself.
Long Answer
This is a linguistic parsing problem that is not easily solved unless you don't mind gibberish.
Your linking algorithm will create sentences like:
I'm not quite sure of the SEO applications of this, but I would not welcome another gibberish spam site taking up a search result.
I imagine you would want some sort of Inverted index structure. You would have a Hashmap with the words as keys pointing to lists of pairs of the form (sentence_id, position)
. You would then store your sentences as arrays or linked lists. Your example would look like this:
sentence[0] = ['the','beautiful', 'sky'];
sentence[1] = ['beautiful','sky', 'dream'];
sentence[2] = ['beautiful', 'dream'];
inverted_index =
{
'the': {(0,0)},
'beautiful': {(0,1), (1,0), (2,0)},
'sky' : {(0,2),(1,1)},
'dream':{(1,2), (2,1)}
};
Using this structure lookups on words can be done in constant time. Having identified the word you want, finding the previous and subsequent word in a given sentence can also be done in constant time.
Hope this helps.
You can try and dig into Markov chains, formed from words of sentences. Also you'll require both-way chain (i.e. to find next and previous words), i.e. store probable words that appear just after the given or just before it.
Of course, Markov chain is a stochastic process to generate content, however similar approach may be used to store information you need.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With