I've always been interested in developing a web search engine. What's a good place to start? I've heard of Lucene, but I'm not a big Java guy. Any other good resources or open source projects? I understand it's a huge under-taking, but that's part of the appeal. I'm not looking to create the next Google, just something I can use to search a sub-set of sites that I might be interested in.

There are several parts to a search engine. Broadly speaking, in a hopelessly general manner (folks, feel free to edit if you feel you can add better descriptions, links, etc): <ol> <li>The crawler. This is the part that goes through the web, grabs the pages, and stores information about them into some central data store. In addition to the text itself, you will want things like the time you accessed it, etc. The crawler needs to be smart enough to know how often to hit certain domains, to obey the robots.txt convention, etc. </li> <li>The parser. This reads the data fetched by the crawler, parses it, saves whatever metadata it needs to, throws away junk, and possibly makes suggestions to the crawler on what to fetch next time around.</li> <li>The indexer. Reads the stuff the parser parsed, and creates inverted indexes into the terms found on the webpages. It can be as smart as you want it to be -- apply NLP techniques to make indexes of concepts, cross-link things, throw in synonyms, etc.</li> <li>The ranking engine. Given a few thousand URLs matching "apple", how do you decide which result is the best? Jut the index doesn't give you that information. You need to analyze the text, the linking structure, and whatever other pieces you want to look at, and create some scores. This may be done completely on the fly (that's really hard), or based on some pre-computed notions of "experts" (see PageRank, etc).</li> <li>The front end. Something needs to receive user queries, hit the central engine, and respond; this something needs to be smart about caching results, possibly mixing in results from other sources, etc. It has its own set of problems.</li> </ol> My advice -- choose which of these interests you the most, download Lucene or Xapian or any other open source project out there, pull out the bit that does one of the above tasks, and try to replace it. Hopefully, with something better :-). Some links that may prove useful: "Agile web-crawler", a paper from Estonia (in English) Sphinx Search engine, an indexing and search api. Designed for large DBs, but modular and open-ended. "Information Retrieval, a textbook about IR from Manning et al. Good overview of how the indexes are built, various issues that come up, as well as some discussion of crawling, etc. Free online version (for now)!

Building a web search engine [closed]

2 Answers

There are several parts to a search engine. Broadly speaking, in a hopelessly general manner (folks, feel free to edit if you feel you can add better descriptions, links, etc):

The crawler. This is the part that goes through the web, grabs the pages, and stores information about them into some central data store. In addition to the text itself, you will want things like the time you accessed it, etc. The crawler needs to be smart enough to know how often to hit certain domains, to obey the robots.txt convention, etc.
The parser. This reads the data fetched by the crawler, parses it, saves whatever metadata it needs to, throws away junk, and possibly makes suggestions to the crawler on what to fetch next time around.
The indexer. Reads the stuff the parser parsed, and creates inverted indexes into the terms found on the webpages. It can be as smart as you want it to be -- apply NLP techniques to make indexes of concepts, cross-link things, throw in synonyms, etc.
The ranking engine. Given a few thousand URLs matching "apple", how do you decide which result is the best? Jut the index doesn't give you that information. You need to analyze the text, the linking structure, and whatever other pieces you want to look at, and create some scores. This may be done completely on the fly (that's really hard), or based on some pre-computed notions of "experts" (see PageRank, etc).
The front end. Something needs to receive user queries, hit the central engine, and respond; this something needs to be smart about caching results, possibly mixing in results from other sources, etc. It has its own set of problems.

My advice -- choose which of these interests you the most, download Lucene or Xapian or any other open source project out there, pull out the bit that does one of the above tasks, and try to replace it. Hopefully, with something better :-).

Some links that may prove useful: "Agile web-crawler", a paper from Estonia (in English) Sphinx Search engine, an indexing and search api. Designed for large DBs, but modular and open-ended. "Information Retrieval, a textbook about IR from Manning et al. Good overview of how the indexes are built, various issues that come up, as well as some discussion of crawling, etc. Free online version (for now)!

117

answered Oct 13 '22 22:10

3 revs

Xapian is another option for you. I've heard it scales better than some implementations of Lucene.

answered Oct 13 '22 23:10

Oli

Related questions
                            
                                Tagging CALayers in iPhone
                            
                                How to optimally solve the flood fill puzzle?
                            
                                find svn revision by removed text
                            
                                O(klogk) time algorithm to find kth smallest element from a binary heap
                            
                                Good C string library [closed]
                            
                                Is it possible to search through Git Stash items?
                            
                                replace column values in one dataframe by values of another dataframe
                            
                                How to find a particular JSON value by key?
                            
                                MySQL - How to ORDER BY RELEVANCE? INNODB Table
                            
                                Keyboard shortcut to find the currently selected text in Xcode 4
                            
                                Most efficient way for a lookup/search in a huge list (python)
                            
                                Styling a SearchView in Android Action Bar
                            
                                Listview filter search in Flutter
                            
                                How to search over huge non-text based data sets?
                            
                                Aws S3 Filter by Tags. Search by tags
                            
                                Query all unique values of a field with Elasticsearch
                            
                                Faster than binary search for ordered list
                            
                                search a php array for partial string match [duplicate]
                            
                                How to search in a List of Java object
                            
                                Searching for a sequence of Bytes in a Binary File with Java

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Building a web search engine [closed]

Tags:

search

search-engine

Aseem

People also ask

2 Answers

3 revs

Oli

Recent Activity

Donate For Us