I come up with an interview question that I would like to know your opinion about that. The questions are say that in designing a web crawler:
1) what kind of pages will you hit with a DFS versus BFS?
2) how would you avoid getting into infinite loops?
I appreciate if somebody could answer them.
BFS implementation starts from the source, which is the web page, and then it visits all the links from that source. A broadcasted packet is guided by the BFS algorithm to find and reach all the nodes it has the address for. Here are Important applications of DFS:
For example, Kahn's algorithm uses BFS for finding the topological sort of a DAG whereas Bipartite checking can be done using DFS too. Some applications of Depth First Search (DFS): Some applications of Breadth First Search (DFS):
Whereas, BFS goes level by level, finishing one level completely before moving on to another level. The strategy used by DFS is to go deeper in the graph whenever possible. It explores all the edges of a most recently discovered vertex u to the deepest possible point one at a time.
Applications of DFS 1 Weighted Graph: In a weighted graph, DFS graph traversal generates the shortest path tree and minimum spanning tree. 2 Detecting a Cycle in a Graph: A graph has a cycle if we found a back edge during DFS. ... 3 Path Finding: We can specialize in the DFS algorithm to search a path between two vertices. More items...
1) what kind of pages will you hit with a DFS versus BFS?
In most situations, I will use BFS algorithm to implement a spider because most valuable info I want to get from web pages doesn't have much link depth, otherwise I think the site is not much valuable to crawl because of the bad design.
If I want to get some specific data from one page and other related data from a few hops and at the same time I want to see the results soon after the spider runs, then I may choose DFS algorithm. Say, I want to get all the tags from stackoverflow. The tag page is here. At the same time, I want to get who answer what questions in the tag. And I want to check whether the spider runs properly. Then I use DFS algorithm to get the data tag-questions-answers soon after the spider runs.
In a word, it depends on the usage scenario.
2) how would you avoid getting into infinite loops?
This question may be simple. Solutions are such as:
I remember scrapy seems could solve the second question. You could read its source code to search for a better solution.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With