Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Retrieve data from a website via Visual Basic

There is this website that we purchase widgets from that provides details for each of their parts on its own webpage. Example: http://www.digikey.ca/product-search/en?lang=en&site=ca&KeyWords=AE9912-ND. I have to find all of their parts that are in our database, and add Manufacturer and Manufacturer Part Number values to their fields.

I was told that there is a way for Visual Basic to access a webpage and extract information. If someone could point me in the right direction on where to start, I'm sure I can figure this out.

Thanks.

like image 253
Jackery Xu Avatar asked Feb 13 '13 17:02

Jackery Xu


1 Answers

How to scrape a website using HTMLAgilityPack (VB.Net)

I agree that htmlagilitypack is the easiest way to accomplish this. It is less error prone than just using Regex. The following will be how I deal with scraping.

After downloading htmlagilitypack*dll, create a new application, add htmlagilitypack via nuget, and reference to it. If you can use Chrome, it will allow you to inspect the page to get information about where your information is located. Right-click on a value you wish to capture and look for the table that it is found in (follow the HTML up a bit).

The following example will extract all the values from that page within the "pricing" table. We need to know the XPath value for the table (this value is used to instruct htmlagilitypack on what to look for) so that the document we create looks for our specific values. This can be achieved by finding whatever structure your values are in and right click copy XPath. From this we get...

//*[@id="pricing"]

Please note that sometimes the XPath you get from Chrome may be rather large. You can often simplify it by finding something unique about the table your values are in. In this example it is "id", but in other situations, it could easily be headings or class or whatever.

This XPath value looks for something with the id equal to pricing, that is our table. When we look further in, we see that our values are within tbody,tr and td tags. HtmlAgilitypack doesn't work well with the tbody so ignore it. Our new XPath is...

//*[@id='pricing']/tr/td

This XPath says look for the pricing id within the page, then look for text within its tr and td tags. Now we add the code...

Dim Web As New HtmlAgilityPack.HtmlWeb
Dim Doc As New HtmlAgilityPack.HtmlDocument
Doc = Web.Load("http://www.digikey.ca/product-search/en?lang=en&site=ca&KeyWords=AE9912-ND")
For Each table As HtmlAgilityPack.HtmlNode In Doc.DocumentNode.SelectNodes("//*[@id='pricing']/tr/td")

Next

To extract the values we simply reference our table value that was created in our loop and it's innertext member.

Dim Web As New HtmlAgilityPack.HtmlWeb
Dim Doc As New HtmlAgilityPack.HtmlDocument
Doc = Web.Load("http://www.digikey.ca/product-search/en?lang=en&site=ca&KeyWords=AE9912-ND")
For Each table As HtmlAgilityPack.HtmlNode In Doc.DocumentNode.SelectNodes("//*[@id='pricing']/tr/td")
    MsgBox(table.InnerText)
Next

Now we have message boxes that pop up the values...you can switch the message box for an arraylist to fill or whatever way you wish to store the values. Now simply do the same for whatever other tables you wish to get.

Please note that the Doc variable that was created is reusable, so if you wanted to cycle through a different table in the same page, you do not have to reload the page. This is a good idea especially if you are making many requests, you don't want to slam the website, and if you are automating a large number of scrapes, it puts some time between requests.

Scraping is really that easy. That's is the basic idea. Have fun!

like image 136
MonkeyDoug Avatar answered Nov 15 '22 12:11

MonkeyDoug