When there is no webservice API available, your only option might be to Screen Scrape, but how do you do it in c#? how do you think of doing it?

Matt and Paul's answers are correct. "Screen scraping" by parsing the HTML from a website is usually a bad idea because: <ol> <li>Parsing HTML can be difficult, especially if it's malformed. If you're scraping a very, very simple page then regular expressions might work. Otherwise, use a parsing framework like the HTML Agility Pack. </li> <li>Websites are a moving target. You'll need to update your code each time the source website changes their markup structure.</li> <li>Screen scraping doesn't play well with Javascript. If the target website is using any sort of dynamic script to manipulate the webpage you're going to have a very hard time scraping it. It's easy to grab the HTTP response, it's a lot harder to scrape what the browser displays in response to client-side script contained in that response.</li> </ol> If screen scraping is the only option, here are some keys to success: <ol> <li>Make it as easy as possible to change the patterns you look for. If possible, store the patterns as text files or in a resource file somewhere. Make it very easy for other developers (or yourself in 3 months) to understand what markup you expect to find.</li> <li>Validate input and throw meaningful exceptions. In your parsing code, take care to make your exceptions very helpful. The target site will change on you, and when that happens you want your error messages to tell you not only what part of the code failed, but why it failed. Mention both the pattern you're looking for AND the text you're comparing against.</li> <li>Write lots of automated tests. You want it to be very easy to run your scraper in a non-destructive fashion because you will be doing a lot of iterative development to get the patterns right. Automate as much testing as you can, it will pay off in the long run.</li> <li>Consider a browser automation tool like Watin. If you require complex interactions with the target website it might be easier to write your scraper from the point of view of the browser itself, rather than mucking with the HTTP requests and responses by hand. </li> </ol> As for how to screen scrape in C#, you can either use Watin (see above) and scrape the resulting document using its DOM, or you can use the <code>WebClient</code> class [see MSDN or Google] to get at the raw HTTP response, including the HTML content, and then use some sort of text-based analysis to extract the data you want.

How do you Screen Scrape? [closed]

2 Answers

Matt and Paul's answers are correct. "Screen scraping" by parsing the HTML from a website is usually a bad idea because:

Parsing HTML can be difficult, especially if it's malformed. If you're scraping a very, very simple page then regular expressions might work. Otherwise, use a parsing framework like the HTML Agility Pack.
Websites are a moving target. You'll need to update your code each time the source website changes their markup structure.
Screen scraping doesn't play well with Javascript. If the target website is using any sort of dynamic script to manipulate the webpage you're going to have a very hard time scraping it. It's easy to grab the HTTP response, it's a lot harder to scrape what the browser displays in response to client-side script contained in that response.

If screen scraping is the only option, here are some keys to success:

Make it as easy as possible to change the patterns you look for. If possible, store the patterns as text files or in a resource file somewhere. Make it very easy for other developers (or yourself in 3 months) to understand what markup you expect to find.
Validate input and throw meaningful exceptions. In your parsing code, take care to make your exceptions very helpful. The target site will change on you, and when that happens you want your error messages to tell you not only what part of the code failed, but why it failed. Mention both the pattern you're looking for AND the text you're comparing against.
Write lots of automated tests. You want it to be very easy to run your scraper in a non-destructive fashion because you will be doing a lot of iterative development to get the patterns right. Automate as much testing as you can, it will pay off in the long run.
Consider a browser automation tool like Watin. If you require complex interactions with the target website it might be easier to write your scraper from the point of view of the browser itself, rather than mucking with the HTTP requests and responses by hand.

As for how to screen scrape in C#, you can either use Watin (see above) and scrape the resulting document using its DOM, or you can use the WebClient class [see MSDN or Google] to get at the raw HTTP response, including the HTML content, and then use some sort of text-based analysis to extract the data you want.

122

answered Sep 20 '22 12:09

Seth Petry-Johnson

Use Html Agility Pack. It handles poorly and malformed HTML. It lets you query with XPath, making it very easy to find the data you're looking for. DON'T write a parser by hand and DON'T use regular expressions, it's just too clumsy.

answered Sep 19 '22 12:09

Matthew Olenik

Related questions
                            
                                Check if any property of class is null
                            
                                How to check if session value is null or session key does not exist in asp.net mvc - 5
                            
                                how to undo pending changes of files that are unchanged?
                            
                                Xamarin.Forms: bind to a code behind property in XAML
                            
                                How to fix error "Cannot load PowerShell snap-in Microsoft.PowerShell.Diagnostics because of the following error: Could not load file or assembly"
                            
                                How to get "Display name" on the label of in Blazor's razor page?
                            
                                Best way to replace tokens in a large text template
                            
                                C# .NET 3.0/3.5 features in 2.0 using Visual Studio 2008
                            
                                Enable ViewState for few controls and disable for others/page
                            
                                How hard is it to incorporate full text search with SQL Server?
                            
                                Disposable singleton in C#
                            
                                Simple CRUD Generator for C# [closed]
                            
                                How to use XmlReader class?
                            
                                How to convert an IQueryable<T> to a List<T>?
                            
                                C# Reflection - Base class static fields in Derived type
                            
                                Get GenericType-Name in good format using Reflection on C#
                            
                                Why does resharper suggest readonly fields
                            
                                How to tell if a SqlConnection has an attached SqlDataReader?
                            
                                ASP.NET MVC Email
                            
                                List<T> concurrent removing and adding

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do you Screen Scrape? [closed]

Tags:

c#

web-services

web-scraping

api

001

People also ask

2 Answers

Seth Petry-Johnson

Matthew Olenik

Recent Activity

Donate For Us