Aim
I am looking to scrape 20/20 cricket scorecard data from the Cricinfo website, ideally into CSV form for data analysis in Excel
As an example the current Australian Big Bash 2011/12 scorecards are available from
Background
I am proficient in using VBA (either automating IE
or using XMLHTTP
and then using regular expressions) to scrape data from websites, ie Extract values from HTML TD and Tr
In that same question a comment was posted suggesting html parsing - which I hadn't come accross before - so I have taken a look at questions such as RegEx match open tags except XHTML self-contained tags
Query
While I could write a regex to parse the cricket data below I would like advice as to how I could efficiently retrieve these results with html parsing.
Please bear in mind that my preference is a repeatable CSV format containing:
Nirvana for me would be a solution that I could deploy using VBA or VBscript so I could fully automate my analysis, but I presume I will have to use a separate tool for the html parse.
Sample Site links and Data to be Extracted
There are 2 techniques that I use for "VBA". I will describe them 1 by one.
1) Using FireFox / Firebug Addon / Fiddler
2) Using Excel's inbuilt facility to get data from the web
Since this post will be read by many so I will even cover the obvious. Please feel free to skip whatever part you know
1) Using FireFox / Firebug Addon / Fiddler
FireFox : http://en.wikipedia.org/wiki/Firefox Free download (http://www.mozilla.org/en-US/firefox/new/)
Firebug Addon: http://en.wikipedia.org/wiki/Firebug_%28software%29 Free download (https://addons.mozilla.org/en-US/firefox/addon/firebug/)
Fiddler : http://en.wikipedia.org/wiki/Fiddler_%28software%29 Free download (http://www.fiddler2.com/fiddler2/)
Once you have installed Firefox, install the Firebug Addon. The Firebug Addon lets you inspect the different elements in a webpage. For example if you want to know the name of a button, simply right click on it and click on "Inspect Element with Firebug" and it will give you all the details that you will need for that button.
Another example would be finding the name of a table on a website which has the data that you need scrapped.
I use Fiddler only when I am using XMLHTTP. It helps me to see the exact info being passed when you click on a button. Because of the increase in the number of BOTS which scrape the sites, most sites now, to prevent automatic scrapping, capture your mouse coordinates and pass that information and fiddler actually helps you in debugging that info that is being passed. I will not get into much details here about it as this info can be used maliciously.
Now let's take a simple example on how to scrape the URL posted in your question
http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html
First let's find the name of the table which has that info. Simply right click on the table and click on "Inspect Element with Firebug" and it will give you the below snapshot.
So now we know that our data is stored in a table called "inningsBat1" If we can extract the contents of that table to an Excel file then we can definitely work with the data to do our analysis. Here is sample code which will dump that table in Sheet1
Before we proceed, I would recommend, closing all Excel and starting a fresh instance.
Launch VBA and insert a Userform. Place a command button and a webcrowser control. Your Userform might look like this
Paste this code in the Userform code area
Option Explicit '~~> Set Reference to Microsoft HTML Object Library Private Declare Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As Long) Private Sub CommandButton1_Click() Dim URL As String Dim oSheet As Worksheet Set oSheet = Sheets("Sheet1") URL = "http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html" PopulateDataSheets oSheet, URL MsgBox "Data Scrapped. Please check " & oSheet.Name End Sub Public Sub PopulateDataSheets(wsk As Worksheet, URL As String) Dim tbl As HTMLTable Dim tr As HTMLTableRow Dim insertRow As Long, Row As Long, col As Long On Error GoTo whoa WebBrowser1.navigate URL WaitForWBReady Set tbl = WebBrowser1.Document.getElementById("inningsBat1") With wsk .Cells.Clear insertRow = 0 For Row = 0 To tbl.Rows.Length - 1 Set tr = tbl.Rows(Row) If Trim(tr.innerText) <> "" Then If tr.Cells.Length > 2 Then If tr.Cells(1).innerText <> "Total" Then insertRow = insertRow + 1 For col = 0 To tr.Cells.Length - 1 .Cells(insertRow, col + 1) = tr.Cells(col).innerText Next End If End If End If Next End With whoa: Unload Me End Sub Private Sub Wait(ByVal nSec As Long) nSec = nSec + Timer While Timer < nSec DoEvents Sleep 100 Wend End Sub Private Sub WaitForWBReady() Wait 1 While WebBrowser1.ReadyState <> 4 Wait 3 Wend End Sub
Now run your Userform and click on the Command button. You will notice that the data is dumped in Sheet1. See snapshot
Similarly you can scrape other info as well.
2) Using Excel's inbuilt facility to get data from the web
I believe you are using Excel 2007 so I will take that as an example to scrape the above mentioned link.
Navigate to Sheet2. Now navigate to Data Tab and click on the button "From Web" on the extreme right. See snapshot.
Enter the url in the "New Web Query Window" and click on "Go"
Once the page is uploaded, select the relevant table that you want to import by clicking on the small arrow as shown in the snapshot. Once done, click on "Import"
Excel will then ask you where you want the data to be imported. Select the relevant cell and click on OK. And you are done! The data will be imported to the cell which you specified.
If you wish you can record a macro and automate this as well :)
Here is the macro that I recorded.
Sub Macro1() With ActiveSheet.QueryTables.Add(Connection:= _ "URL;http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html" _ , Destination:=Range("$A$1")) .Name = "524915" .FieldNames = True .RowNumbers = False .FillAdjacentFormulas = False .PreserveFormatting = True .RefreshOnFileOpen = False .BackgroundQuery = True .RefreshStyle = xlInsertDeleteCells .SavePassword = False .SaveData = True .AdjustColumnWidth = True .RefreshPeriod = 0 .WebSelectionType = xlSpecifiedTables .WebFormatting = xlWebFormattingNone .WebTables = """inningsBat1""" .WebPreFormattedTextToColumns = True .WebConsecutiveDelimitersAsOne = True .WebSingleBlockTextImport = False .WebDisableDateRecognition = False .WebDisableRedirections = False .Refresh BackgroundQuery:=False End With End Sub
Hope this helps. Let me know if you still have some queries.
Sid
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With