<p>I have a simple web crawler that starts at root (given url) downloads the html of the root page then scans for hyperlinks and crawls them. I currently store the html pages in an SQL database. I am currently facing two problems:</p> <ol> <li><p>It seems like the crawling reaches a bottleneck and isn't able to crawler faster, I've read somewhere that making multi-threaded http requests for pages can make the crawler crawl faster, but I am not sure on how to do this.</p></li> <li><p>The second problem, I need an efficient data structure to store the html pages and be able to run data mining operations on them (currently using an SQL database would like to hear other recommendations)</p></li> </ol> <p>I am using the .Net framework, C# and MS SQL </p>

<p>So first and foremost, I wouldn't worry about getting into distributed crawling and storage, because as the name suggests: it requires a decent number of machines for you to get good results. Unless you have a farm of computers, then you won't be able to really benefit from it. You can build a crawler that gets 300 pages per second and run it on a single computer with 150 Mbps connection.</p> <p>The next thing on the list is to determine where is your bottleneck. </p> <h3>Benchmark Your System</h3> <p>Try to eliminate MS SQL: </p> <ul> <li>Load a list of, say, 1000 URLs that you want to crawl.</li> <li>Benchmark how fast you can crawl them.</li> </ul> <p>If 1000 URLs doesn't give you a large enough crawl, then get 10000 URLs or 100k URLs (or if you're feeling brave, then get the Alexa top 1 million). In any case, try to establish a baseline with as many variables excluded as possible.</p> <h3>Identify Bottleneck</h3> <p>After you have your baseline for the crawl speed, then try to determine what's causing your slowdown. Furthermore, <strong>you will need to start using multitherading</strong>, because you're i/o bound and you have a lot of spare time in between fetching pages that you can spend in extracting links and doing other things like working with the database.</p> <p>How many pages per second are you getting now? You should try and get more than 10 pages per second.</p> <h3>Improve Speed</h3> <p>Obviously, the next step is to tweak your crawler as much as possible:</p> <ul> <li>Try to speed up your crawler so it hits the hard limits, such as your bandwidth. </li> <li>I would recommend using asynchronous sockets, since they're MUCH faster than blocking sockets, WebRequest/HttpWebRequest, etc.</li> <li>Use a faster HTML parsing library: start with HtmlAgilityPack and if you're feeling brave then try the Majestic12 HTML Parser.</li> <li>Use an embedded database, rather than an SQL database and take advantage of the key/value storage (hash the URL for the key and store the HTML and other relevant data as the value).</li> </ul> <h3>Go Pro!</h3> <p>If you've mastered all of the above, then I would suggest you try to go pro! It's important that you have a good selection algorithm that mimics PageRank in order to balance freshness and coverage: OPIC is pretty much the latest and greatest in that respect (AKA Adaptive Online Page Importance Computation). If you have the above tools, then you should be able to implement OPIC and run a fairly fast crawler.</p> <p>If you're flexible on the programming language and don't want to stray too far from C#, then you can try the Java-based enterprise level crawlers such as Nutch. Nutch integrates with Hadoop and all kinds of other highly scalable solutions.</p>

Where to store web crawler data?

Tags:

c#

algorithm

web-crawler

I have a simple web crawler that starts at root (given url) downloads the html of the root page then scans for hyperlinks and crawls them. I currently store the html pages in an SQL database. I am currently facing two problems:

It seems like the crawling reaches a bottleneck and isn't able to crawler faster, I've read somewhere that making multi-threaded http requests for pages can make the crawler crawl faster, but I am not sure on how to do this.
The second problem, I need an efficient data structure to store the html pages and be able to run data mining operations on them (currently using an SQL database would like to hear other recommendations)

I am using the .Net framework, C# and MS SQL

574

asked Jan 17 '12 01:01

Mike G

2 Answers

So first and foremost, I wouldn't worry about getting into distributed crawling and storage, because as the name suggests: it requires a decent number of machines for you to get good results. Unless you have a farm of computers, then you won't be able to really benefit from it. You can build a crawler that gets 300 pages per second and run it on a single computer with 150 Mbps connection.

The next thing on the list is to determine where is your bottleneck.

Benchmark Your System

Try to eliminate MS SQL:

Load a list of, say, 1000 URLs that you want to crawl.
Benchmark how fast you can crawl them.

If 1000 URLs doesn't give you a large enough crawl, then get 10000 URLs or 100k URLs (or if you're feeling brave, then get the Alexa top 1 million). In any case, try to establish a baseline with as many variables excluded as possible.

Identify Bottleneck

After you have your baseline for the crawl speed, then try to determine what's causing your slowdown. Furthermore, you will need to start using multitherading, because you're i/o bound and you have a lot of spare time in between fetching pages that you can spend in extracting links and doing other things like working with the database.

How many pages per second are you getting now? You should try and get more than 10 pages per second.

Improve Speed

Obviously, the next step is to tweak your crawler as much as possible:

Try to speed up your crawler so it hits the hard limits, such as your bandwidth.
I would recommend using asynchronous sockets, since they're MUCH faster than blocking sockets, WebRequest/HttpWebRequest, etc.
Use a faster HTML parsing library: start with HtmlAgilityPack and if you're feeling brave then try the Majestic12 HTML Parser.
Use an embedded database, rather than an SQL database and take advantage of the key/value storage (hash the URL for the key and store the HTML and other relevant data as the value).

Go Pro!

If you've mastered all of the above, then I would suggest you try to go pro! It's important that you have a good selection algorithm that mimics PageRank in order to balance freshness and coverage: OPIC is pretty much the latest and greatest in that respect (AKA Adaptive Online Page Importance Computation). If you have the above tools, then you should be able to implement OPIC and run a fairly fast crawler.

If you're flexible on the programming language and don't want to stray too far from C#, then you can try the Java-based enterprise level crawlers such as Nutch. Nutch integrates with Hadoop and all kinds of other highly scalable solutions.

181

answered Oct 16 '22 16:10

Kiril

This is what Google's BigTable was designed for. HBase is a popular open-source clone, but you'll need to deal with Java and (probably) Linux. Cassandra is also written in Java, but runs on Windows. Both allow for .NET clients.

Because they are designed to be distributed across many machines (implementations in the thousands of nodes exist), they can sustain extremely heavy read/write loads, far more than even the fastest SQL Server or Oracle hardware could.

If you are not comfortable with Java infrastructure, you might want to look into Microsoft's Azure Table Storage, for similar characteristics. That's a hosted/cloud solution though- you can't run it on your own hardware.

As for processing the data, if you go for HBase or Cassandra you can use Hadoop MapReduce. MR was popularized by Google for exactly the task you are describing- processing huge amounts of web data. In a nutshell, the idea is that rather than running your algorithm in one place and piping all of the data through it, MapReduce sends your program out to run on the machines where the data is stored. It allows you to run algorithms on basically unlimited amounts of data, assuming that you have the hardware for it.

answered Oct 16 '22 15:10

Chris Shain

Related questions
                            
                                No-argument method on window.external is invoked when checking with typeof
                            
                                can i add .h and .cpp files in a c# project?
                            
                                C# -Why does System.IO.File.GetLastAccessTime return an expected value when the file is not found?
                            
                                Detecting if another instance of the application is already running
                            
                                ZeroMQ / ØMQ / 0MQ how to get started?
                            
                                When to use StreamReader.ReadBlock()?
                            
                                How can I set up two navigation properties of the same type in Entity Framework
                            
                                Using two DLLs with same name and same namespace
                            
                                C# Marshalling double* from C++ DLL?
                            
                                How does DataAnnotations really work in MVC?
                            
                                CA1726: FxCop Forbidden Word: Flags
                            
                                in what situation will an item in System.Collections.Generic.List not be removed successfully?
                            
                                Cast Boxed Object back to Original Type
                            
                                How to automatically truncate string when do bulk insert?
                            
                                Difference between System.DateTime and System.DateTimeOffset
                            
                                Is there a way to create a DynamicObject that supports an Interface?
                            
                                Simple Data Unit of Work implementation
                            
                                Operator 'op ' cannot be applied to operands of type 'dynamic' and 'lambda expression'
                            
                                how to set asp:HyperLink href to "mailto:[email protected]" in .net c#
                            
                                serializing data using json.net size limit?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With