DFS vs BFS in web crawler design [closed]

1 Answers

1) what kind of pages will you hit with a DFS versus BFS?

In most situations, I will use BFS algorithm to implement a spider because most valuable info I want to get from web pages doesn't have much link depth, otherwise I think the site is not much valuable to crawl because of the bad design.

If I want to get some specific data from one page and other related data from a few hops and at the same time I want to see the results soon after the spider runs, then I may choose DFS algorithm. Say, I want to get all the tags from stackoverflow. The tag page is here. At the same time, I want to get who answer what questions in the tag. And I want to check whether the spider runs properly. Then I use DFS algorithm to get the data tag-questions-answers soon after the spider runs.

In a word, it depends on the usage scenario.

2) how would you avoid getting into infinite loops?

This question may be simple. Solutions are such as:

Use MAX LINK DEPTH.
Record urls that you have crawled and before emit a new request, check whether the url has been crawled.

I remember scrapy seems could solve the second question. You could read its source code to search for a better solution.

152

answered Sep 20 '22 20:09

flyer

Related questions
                            
                                Web crawler using perl
                            
                                wget for fetching Facebook profile/friend pages
                            
                                Crawlable AJAX with _escaped_fragment_ in htaccess
                            
                                Equivalent of wget in Python to download website and resources
                            
                                Lucene - Reading all field names that are stored
                            
                                Using Web crawler for price comparison
                            
                                What does the dollar sign mean in robots.txt
                            
                                Run Multiple Spider sequentially
                            
                                After doing HttpWebRequests for a while the result starts timing out
                            
                                Deny access but allow robots i.e. Google to sitemap.xml
                            
                                How can I bring google-like recrawling in my application(web or console)
                            
                                Crawler url queue or hash list?
                            
                                running multiple threads in python, simultaneously - is it possible?
                            
                                Will Googlebot crawl changes to the DOM made with JavaScript?
                            
                                python-how to crawl past __VIEWSTATE
                            
                                Scrapy: downloader/response_count vs response_received_count
                            
                                Is it possible to scrape all text messages from Whatsapp Web with Scrapy?
                            
                                how to allow known web crawlers and block spammers and harmful robots from scanning asp.net website
                            
                                port error in scrapy
                            
                                How do I extract data from a website using javascript.

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

DFS vs BFS in web crawler design [closed]

Tags:

depth-first-search

web-crawler

webpage

Nazgol

People also ask

1 Answers

flyer

Recent Activity

Donate For Us