Do you guys know where i can find a search engine parser design diagram? I need to understand how it processes user input. what functions / algorithms are being used? conditions. etc.
It doesn't have to be Google's.
Updated question to search engine parser
You need a better understanding about search engines first. There are normally
1) a web crawler, something that get the documents you want to add to your search data space. THis is usually totally outside the scope of what you call "search engine".
2) a parser which is taking the document and splitting it into indexable text fragments. If usually works with different file formats, human languages and is preprocessing the text in maybe some fixed records and flow text. Linguistic algorithms (like stemmers - search for Porter Stemmer to get simple one) are also applied here.
3) A indexer which might be as simple as an inverted list of words per document or as complex as you want if you try to be as clever as google. Building an index is the really magic part of a successfull search engine. Usually there are multiple ranking algorithms that are put together.
4) The frontend with an optional query language. THis is where google is really bad but as you can see on googles success it might not be so important for 98% of the people. But i really miss this.
I think you are asking for (3) the indexer. Basically there are 2 different kind of algorithms you find in classic information retrieval literature. Vector Space model and Boolean Search. The later is easy, just check if the search words are inside the document and return a boolean value. Each search term can be given a relevanz probability. And for different search terms you can use Bayesian probability to sum up the relevanz and add return the highest ranked documents. The vector model treats a document as a vector of all its words you can build a scalar vector product between documents to judge if they are close together - this is a much more complex theroy. The father of IR (information retrieval) was Gerald Salton, you will find a lot of literature under his name.
This was the state of IR art until 1999 (i wrote my diploma thesis about a usenet news search engine in 1998). Then google came and all the theory went into the trashcan of academic stupidity and pratical irrelevanz.
Google was not build on mainstream IR theory. Read in the link that Srirangan gave you about it. Its just an ad hock relevanz function build on many many different sources. You will not find anything in this area beside white paper marketing blablabla. This algorithms are the business secret and capital of the search engine companies.
For simple search engines look at the lucence library or at dtsearch which was always my choice for an embeddable search engine library.
There is not really a lot of example code nor available information in the open source world about IR technology. Most of them like lucense are just implementing the most primitive operations. You have to buy books and go to a university library to get access to research literature.
As literature i would recommend starting with this book link text alt text http://ecx.images-amazon.com/images/I/41HKJYHTQDL._BO2,204,203,200_PIsitb-sticker-arrow-click,TopRight,35,-76_AA240_SH20_OU01_.jpg
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With