I have been given the task at work of screen scraping one of our legacy web apps to extract certain data from the code. The data is formatted and "should" be displayed exactly the same every time. I am just not sure how to go about doing this. It's a full html file with header and footer navigations but in the middle of all this is the data I need.
I need to extract the Company Name value, Contact Name, Telephone, email address, etc.
Here is an example of what the code looks like:
...html above here
<br /><br />
<table cellpadding="0" cellspacing="12" border="0">
<tr>
<td valign="top" align="center">
<!-- Company Info -->
<table cellpadding="0" cellspacing="0" border="0">
<tr>
<td class="black">
<table cellspacing="1" cellpadding="0" border="0" width="370">
<tr>
<th>ABC INDUSTRIES</th>
</tr>
<tr>
<td class="search">
<table cellpadding="5" cellspacing="0" border="0" width="100%">
<tr>
<td>
<table cellpadding="1" cellspacing="0" border="0" width="100%">
<tr>
<td align="center" colspan="2"><hr></td>
</tr>
<tr>
<td align="right" nowrap><b><font color="FF0000">Contact Person <img src="/images/icon_contact.gif" align="absmiddle"> :</font></b></td>
<td align="left" width="100%"> Joe Smith</td>
</tr>
<tr>
<td align="right" nowrap><b><font color="FF0000">Phone Number <img src="/images/icon_phone.gif" align="absmiddle"> :</font></b></td>
<td align="left" width="100%"> 555-555-5555</td>
</tr>
<tr>
<td align="right" nowrap><b><font color="FF0000">E-mail Address <img src="/images/icon_email.gif" align="absmiddle"> :</font></b></td>
<td align="left" width="100%"> <a HREF="mailto:[email protected]">[email protected]</a></td>
</tr>
more...
There is more code on the screen in a different table structure that I also need to pull.
Similar to Python, C# is widely used for web scraping. When deciding on which programming language to choose, selecting the one you're most familiar with is essential. Yet, you'll be able to find example web scrapers in both Python and C#.
Web scraping and crawling aren't illegal by themselves unless people use them for harmful activities, such as competitive data mining, online fraud, account hijacking, data theft, stealing of intellectual property, or other illegal purposes.
So is it legal or illegal? Web scraping and crawling aren't illegal by themselves. After all, you could scrape or crawl your own website, without a hitch. Startups love it because it's a cheap and powerful way to gather data without the need for partnerships.
Diverse actors leverage web scraping bots, including nefarious competitors, internet upstarts, cybercriminals, hackers, and spammers, to effortlessly steal whatever pieces of content they are programmed to find, and often mimic regular user behavior, making them hard to detect and even harder to block.
Are you just looking for suggestions on how to accomplish this? The HTML Agility Pack is probably going to be your best bet for DOM parsing in general. There may be a good bit of tinkering and trial and error to maintain your screen scrape (there usually is for that sort of thing), but that library is pretty good for parsing HTML.
Technically, any XML parsing (even native LINQ to XML) should do the trick, but websites have a nasty habit of not being well-formed so you may run into small headaches here and there.
In recent projects, I successfully used the WebRequest
and related classed to download the HTML from an URL and then SgmlReader parser to actually get access to the structured content.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With