Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Screen Scraping HTML with C# [closed]

I have been given the task at work of screen scraping one of our legacy web apps to extract certain data from the code. The data is formatted and "should" be displayed exactly the same every time. I am just not sure how to go about doing this. It's a full html file with header and footer navigations but in the middle of all this is the data I need.

I need to extract the Company Name value, Contact Name, Telephone, email address, etc.

Here is an example of what the code looks like:

...html above here

<br /><br />
<table cellpadding="0" cellspacing="12" border="0">
    <tr>
        <td valign="top" align="center">
            <!-- Company Info -->

            <table cellpadding="0" cellspacing="0" border="0">
                <tr>
                    <td class="black">
                        <table cellspacing="1" cellpadding="0" border="0" width="370">
                            <tr>
                                <th>ABC INDUSTRIES</th>
                            </tr>
                            <tr>
                                <td class="search">

                                    <table cellpadding="5" cellspacing="0" border="0" width="100%">
                                        <tr>
                                            <td>
                                                <table cellpadding="1" cellspacing="0" border="0" width="100%">
                                                   <tr>
                                                        <td align="center" colspan="2"><hr></td>
                                                    </tr>
                                                    <tr>
                                                        <td align="right" nowrap><b><font color="FF0000">Contact Person&nbsp;<img src="/images/icon_contact.gif" align="absmiddle">&nbsp;:</font></b></td>
                                                        <td align="left" width="100%">&nbsp;Joe Smith</td>
                                                    </tr>
                                                    <tr>
                                                        <td align="right" nowrap><b><font color="FF0000">Phone Number&nbsp;<img src="/images/icon_phone.gif" align="absmiddle">&nbsp;:</font></b></td>
                                                        <td align="left" width="100%">&nbsp;555-555-5555</td>
                                                    </tr>
                                                    <tr>
                                                        <td align="right" nowrap><b><font color="FF0000">E-mail Address&nbsp;<img src="/images/icon_email.gif" align="absmiddle">&nbsp;:</font></b></td>
                                                        <td align="left" width="100%">&nbsp;<a HREF="mailto:[email protected]">[email protected]</a></td>
                                                    </tr>
                                                    more...

There is more code on the screen in a different table structure that I also need to pull.

like image 736
WildBill Avatar asked Jan 03 '11 19:01

WildBill


People also ask

Can you web scrape with C#?

Similar to Python, C# is widely used for web scraping. When deciding on which programming language to choose, selecting the one you're most familiar with is essential. Yet, you'll be able to find example web scrapers in both Python and C#.

Is Octoparse legal?

Web scraping and crawling aren't illegal by themselves unless people use them for harmful activities, such as competitive data mining, online fraud, account hijacking, data theft, stealing of intellectual property, or other illegal purposes.

Is HTML scraping legal?

So is it legal or illegal? Web scraping and crawling aren't illegal by themselves. After all, you could scrape or crawl your own website, without a hitch. Startups love it because it's a cheap and powerful way to gather data without the need for partnerships.

Do hackers use web scraping?

Diverse actors leverage web scraping bots, including nefarious competitors, internet upstarts, cybercriminals, hackers, and spammers, to effortlessly steal whatever pieces of content they are programmed to find, and often mimic regular user behavior, making them hard to detect and even harder to block.


2 Answers

Are you just looking for suggestions on how to accomplish this? The HTML Agility Pack is probably going to be your best bet for DOM parsing in general. There may be a good bit of tinkering and trial and error to maintain your screen scrape (there usually is for that sort of thing), but that library is pretty good for parsing HTML.

Technically, any XML parsing (even native LINQ to XML) should do the trick, but websites have a nasty habit of not being well-formed so you may run into small headaches here and there.

like image 83
David Avatar answered Sep 28 '22 04:09

David


In recent projects, I successfully used the WebRequestand related classed to download the HTML from an URL and then SgmlReader parser to actually get access to the structured content.

like image 32
Uwe Keim Avatar answered Sep 28 '22 05:09

Uwe Keim