c# .net 4.5 async / multithread?

Tags:

I'm writing a C# console application that scrapes data from web pages.

This application will go to about 8000 web pages and scrape data(same format of data on each page).

I have it working right now with no async methods and no multithreading.

However, I need it to be faster. It only uses about 3%-6% of the CPU, I think because it spends the time waiting to download the html.(WebClient.DownloadString(url))

This is the basic flow of my program

DataSet alldata;  foreach(var url in the8000urls) {     // ScrapeData downloads the html from the url with WebClient.DownloadString     // and scrapes the data into several datatables which it returns as a dataset.     DataSet dataForOnePage = ScrapeData(url);      //merge each table in dataForOnePage into allData }  // PushAllDataToSql(alldata);

Ive been trying to multi thread this but am not sure how to properly get started. I'm using .net 4.5 and my understanding is async and await in 4.5 are made to make this much easier to program but I'm still a little lost.

My idea was to just keep making new threads that are async for this line

DataSet dataForOnePage = ScrapeData(url);

and then as each one finishes, run

//merge each table in dataForOnePage into allData

Can anyone point me in the right direction on how to make that line async in .net 4.5 c# and then have my merge method run on complete?

Thank you.

Edit: Here is my ScrapeData method:

public static DataSet GetProperyData(CookieAwareWebClient webClient, string pageid) {     var dsPageData = new DataSet();      // DOWNLOAD HTML FOR THE REO PAGE AND LOAD IT INTO AN HTMLDOCUMENT     string url = @"https://domain.com?&id=" + pageid + @"restofurl";     string html = webClient.DownloadString(url);     var doc = new HtmlDocument();     doc.LoadHtml(html );      // A BUNCH OF PARSING WITH HTMLAGILITY AND STORING IN dsPageData      return dsPageData ; }

223

asked Jul 24 '12 20:07

Kyle

2 Answers

If you want to use the async and await keywords (although you don't have to, but they do make things easier in .NET 4.5), you would first want to change your ScrapeData method to return a Task<T> instance using the async keyword, like so:

async Task<DataSet> ScrapeDataAsync(Uri url) {     // Create the HttpClientHandler which will handle cookies.     var handler = new HttpClientHandler();      // Set cookies on handler.      // Await on an async call to fetch here, convert to a data     // set and return.     var client = new HttpClient(handler);      // Wait for the HttpResponseMessage.     HttpResponseMessage response = await client.GetAsync(url);      // Get the content, await on the string content.     string content = await response.Content.ReadAsStringAsync();      // Process content variable here into a data set and return.     DataSet ds = ...;      // Return the DataSet, it will return Task<DataSet>.     return ds; }

Note that you'll probably want to move away from the WebClient class, as it doesn't support Task<T> inherently in its async operations. A better choice in .NET 4.5 is the HttpClient class. I've chosen to use HttpClient above. Also, take a look at the HttpClientHandler class, specifically the CookieContainer property which you'll use to send cookies with each request.

However, this means that you will more than likely have to use the await keyword to wait for another async operation, which in this case, would more than likely be the download of the page. You'll have to tailor your calls that download data to use the asynchronous versions and await on those.

Once that is complete, you would normally call await on that, but you can't do that in this scenario because you would await on a variable. In this scenario, you are running a loop, so the variable would be reset with each iteration. In this case, it's better to just store the Task<T> in an array like so:

DataSet alldata = ...;  var tasks = new List<Task<DataSet>>();  foreach(var url in the8000urls) {     // ScrapeData downloads the html from the url with      // WebClient.DownloadString     // and scrapes the data into several datatables which      // it returns as a dataset.     tasks.Add(ScrapeDataAsync(url)); }

There is the matter of merging the data into allData. To that end, you want to call the ContinueWith method on the Task<T> instance returned and perform the task of adding the data to allData:

DataSet alldata = ...;  var tasks = new List<Task<DataSet>>();  foreach(var url in the8000urls) {     // ScrapeData downloads the html from the url with      // WebClient.DownloadString     // and scrapes the data into several datatables which      // it returns as a dataset.     tasks.Add(ScrapeDataAsync(url).ContinueWith(t => {         // Lock access to the data set, since this is         // async now.         lock (allData)         {              // Add the data.         }     }); }

Then, you can wait on all the tasks using the WhenAll method on the Task class and await on that:

// After your loop. await Task.WhenAll(tasks);  // Process allData

However, note that you have a foreach, and WhenAll takes an IEnumerable<T> implementation. This is a good indicator that this is suitable to use LINQ, which it is:

DataSet alldata;  var tasks =      from url in the8000Urls     select ScrapeDataAsync(url).ContinueWith(t => {         // Lock access to the data set, since this is         // async now.         lock (allData)         {              // Add the data.         }     });  await Task.WhenAll(tasks);  // Process allData

You can also choose not to use query syntax if you wish, it doesn't matter in this case.

Note that if the containing method is not marked as async (because you are in a console application and have to wait for the results before the app terminates) then you can simply call the Wait method on the Task returned when you call WhenAll:

// This will block, waiting for all tasks to complete, all // tasks will run asynchronously and when all are done, then the // code will continue to execute. Task.WhenAll(tasks).Wait();  // Process allData.

Namely, the point is, you want to collect your Task instances into a sequence and then wait on the entire sequence before you process allData.

However, I'd suggest trying to process the data before merging it into allData if you can; unless the data processing requires the entire DataSet, you'll get even more performance gains by processing the as much of the data you get back when you get it back, as opposed to waiting for it all to get back.

144

answered Sep 21 '22 22:09

casperOne

You could also use TPL Dataflow, which is a good fit for this kind of problem.

In this case, you build a "dataflow mesh" and then your data flows through it.

This one is actually more like a pipeline than a "mesh". I'm putting in three steps: Download the (string) data from the URL; Parse the (string) data into HTML and then into a DataSet; and Merge the DataSet into the master DataSet.

First, we create the blocks that will go in the mesh:

DataSet allData; var downloadData = new TransformBlock<string, string>(   async pageid =>   {     System.Net.WebClient webClient = null;     var url = "https://domain.com?&id=" + pageid + "restofurl";     return await webClient.DownloadStringTaskAsync(url);   },   new ExecutionDataflowBlockOptions   {     MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded,   }); var parseHtml = new TransformBlock<string, DataSet>(   html =>   {     var dsPageData = new DataSet();     var doc = new HtmlDocument();     doc.LoadHtml(html);      // HTML Agility parsing      return dsPageData;   },   new ExecutionDataflowBlockOptions   {     MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded,   }); var merge = new ActionBlock<DataSet>(   dataForOnePage =>   {     // merge dataForOnePage into allData   });

Then we link the three blocks together to create the mesh:

downloadData.LinkTo(parseHtml); parseHtml.LinkTo(merge);

Next, we start pumping data into the mesh:

foreach (var pageid in the8000urls)   downloadData.Post(pageid);

And finally, we wait for each step in the mesh to complete (this will also cleanly propagate any errors):

downloadData.Complete(); await downloadData.Completion; parseHtml.Complete(); await parseHtml.Completion; merge.Complete(); await merge.Completion;

The nice thing about TPL Dataflow is that you can easily control how parallel each part is. For now, I've set both the download and parsing blocks to be Unbounded, but you may want to restrict them. The merge block uses the default maximum parallelism of 1, so no locks are necessary when merging.

answered Sep 25 '22 22:09

Stephen Cleary

Related questions
                            
                                Detect change of resolution c# WinForms
                            
                                Multithreading improvements in .NET 4
                            
                                C# - Numeric Suffixes [duplicate]
                            
                                Are there alternatives to ASP.NET for C# web development?
                            
                                Does Array.Copy work with multidimensional arrays?
                            
                                Why doesn't object have an overload that accepts IFormatProvider?
                            
                                Why cannot IEnumerable<struct> be cast as IEnumerable<object>?
                            
                                How to Read an embedded resource as array of bytes without writing it to disk?
                            
                                How do I compile a C# solution with Roslyn?
                            
                                How do I pass string with spaces to converterParameter?
                            
                                How do I provide ILogger<T> in my unit tests of .NET Core code?
                            
                                Is nesting constructors (or factory methods) good, or should each do all init work [closed]
                            
                                Resolving extension methods/LINQ ambiguity
                            
                                How to unit test the default case of an enum based switch statement
                            
                                Is there an equivalent to the C# "var" keyword in C++/CLI?
                            
                                Draw a parallel line
                            
                                Visual studio one project with several dlls as output?
                            
                                How to find Control in TemplateField of GridView?
                            
                                optional array Parameter in C# [duplicate]
                            
                                Exclude list items that contain values from another list

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

c# .net 4.5 async / multithread?

Tags:

c#

multithreading

.net-4.5

Kyle

People also ask

2 Answers

casperOne

Stephen Cleary

Recent Activity

Donate For Us