Screen Scraping HTML with C# [closed]

Tags:

I have been given the task at work of screen scraping one of our legacy web apps to extract certain data from the code. The data is formatted and "should" be displayed exactly the same every time. I am just not sure how to go about doing this. It's a full html file with header and footer navigations but in the middle of all this is the data I need.

I need to extract the Company Name value, Contact Name, Telephone, email address, etc.

Here is an example of what the code looks like:

...html above here

<br /><br />
<table cellpadding="0" cellspacing="12" border="0">
    <tr>
        <td valign="top" align="center">
            <!-- Company Info -->

            <table cellpadding="0" cellspacing="0" border="0">
                <tr>
                    <td class="black">
                        <table cellspacing="1" cellpadding="0" border="0" width="370">
                            <tr>
                                <th>ABC INDUSTRIES</th>
                            </tr>
                            <tr>
                                <td class="search">

                                    <table cellpadding="5" cellspacing="0" border="0" width="100%">
                                        <tr>
                                            <td>
                                                <table cellpadding="1" cellspacing="0" border="0" width="100%">
                                                   <tr>
                                                        <td align="center" colspan="2"><hr></td>
                                                    </tr>
                                                    <tr>
                                                        <td align="right" nowrap><b><font color="FF0000">Contact Person&nbsp;<img src="/images/icon_contact.gif" align="absmiddle">&nbsp;:</font></b></td>
                                                        <td align="left" width="100%">&nbsp;Joe Smith</td>
                                                    </tr>
                                                    <tr>
                                                        <td align="right" nowrap><b><font color="FF0000">Phone Number&nbsp;<img src="/images/icon_phone.gif" align="absmiddle">&nbsp;:</font></b></td>
                                                        <td align="left" width="100%">&nbsp;555-555-5555</td>
                                                    </tr>
                                                    <tr>
                                                        <td align="right" nowrap><b><font color="FF0000">E-mail Address&nbsp;<img src="/images/icon_email.gif" align="absmiddle">&nbsp;:</font></b></td>
                                                        <td align="left" width="100%">&nbsp;<a HREF="mailto:[email protected]">[email protected]</a></td>
                                                    </tr>
                                                    more...

There is more code on the screen in a different table structure that I also need to pull.

736

asked Jan 03 '11 19:01

WildBill

2 Answers

Are you just looking for suggestions on how to accomplish this? The HTML Agility Pack is probably going to be your best bet for DOM parsing in general. There may be a good bit of tinkering and trial and error to maintain your screen scrape (there usually is for that sort of thing), but that library is pretty good for parsing HTML.

Technically, any XML parsing (even native LINQ to XML) should do the trick, but websites have a nasty habit of not being well-formed so you may run into small headaches here and there.

answered Sep 28 '22 04:09

David

In recent projects, I successfully used the WebRequestand related classed to download the HTML from an URL and then SgmlReader parser to actually get access to the structured content.

answered Sep 28 '22 05:09

Uwe Keim

Related questions
                            
                                DefaultMemberAttribute - what does it do?
                            
                                sorting enum for UI purpose
                            
                                Interlocked used to increment/mimick a boolean, is this safe?
                            
                                How do I make my form transparent, but what I draw on it not?
                            
                                Whats the difference between PostSharp and Castle Dynamic Proxy?
                            
                                SelectSingleNode always returns null?
                            
                                How do I handle Canvas.Top change event in WPF?
                            
                                Multiassignment in VB like in C-Style languages
                            
                                Is there a fast alternative to creating a Texture2D from a Bitmap object in XNA?
                            
                                MVC2 TextBoxFor value not updating after submit?
                            
                                Func Delegate vs Function
                            
                                C# Comparing strings with different case [duplicate]
                            
                                C# 4.0 Optional Parameters - How to Specify Optional Parameter of Type "Guid"?
                            
                                In WPF how do I implement ICommandSource to give my custom control ability to use Command from xaml?
                            
                                static method cannot implement interface method, why?
                            
                                Converting Image to bitmap turns background black
                            
                                The calling thread must be STA, because many UI components require this in WPF [duplicate]
                            
                                How to have different solutions refer to one resx file?
                            
                                Accessing Google spelling/suggestion API via C# [closed]
                            
                                Test whether two IEnumerable<T> have the same values with the same frequencies

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With