Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

DFS vs BFS in web crawler design [closed]

I come up with an interview question that I would like to know your opinion about that. The questions are say that in designing a web crawler:

1) what kind of pages will you hit with a DFS versus BFS?

2) how would you avoid getting into infinite loops?

I appreciate if somebody could answer them.

like image 530
Nazgol Avatar asked Dec 14 '13 02:12

Nazgol


People also ask

What is the use of BFS in DFS?

BFS implementation starts from the source, which is the web page, and then it visits all the links from that source. A broadcasted packet is guided by the BFS algorithm to find and reach all the nodes it has the address for. Here are Important applications of DFS:

What is the difference between BFS and depth first search (DFS)?

For example, Kahn's algorithm uses BFS for finding the topological sort of a DAG whereas Bipartite checking can be done using DFS too. Some applications of Depth First Search (DFS): Some applications of Breadth First Search (DFS):

What is the difference between DFs and BFS in graph theory?

Whereas, BFS goes level by level, finishing one level completely before moving on to another level. The strategy used by DFS is to go deeper in the graph whenever possible. It explores all the edges of a most recently discovered vertex u to the deepest possible point one at a time.

What are the applications of DFS in data structure?

Applications of DFS 1 Weighted Graph: In a weighted graph, DFS graph traversal generates the shortest path tree and minimum spanning tree. 2 Detecting a Cycle in a Graph: A graph has a cycle if we found a back edge during DFS. ... 3 Path Finding: We can specialize in the DFS algorithm to search a path between two vertices. More items...


1 Answers

1) what kind of pages will you hit with a DFS versus BFS?

In most situations, I will use BFS algorithm to implement a spider because most valuable info I want to get from web pages doesn't have much link depth, otherwise I think the site is not much valuable to crawl because of the bad design.

If I want to get some specific data from one page and other related data from a few hops and at the same time I want to see the results soon after the spider runs, then I may choose DFS algorithm. Say, I want to get all the tags from stackoverflow. The tag page is here. At the same time, I want to get who answer what questions in the tag. And I want to check whether the spider runs properly. Then I use DFS algorithm to get the data tag-questions-answers soon after the spider runs.

In a word, it depends on the usage scenario.

2) how would you avoid getting into infinite loops?

This question may be simple. Solutions are such as:

  • Use MAX LINK DEPTH.
  • Record urls that you have crawled and before emit a new request, check whether the url has been crawled.

I remember scrapy seems could solve the second question. You could read its source code to search for a better solution.

like image 152
flyer Avatar answered Sep 20 '22 20:09

flyer