Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What type of HTML table is this and what type of webscraping techniques can you use? [closed]

I am trying to extract data within in this link, http://www.rchsd.org/doctors/index.htm?strt=0&ln=&fn=&sp=&grp=&loc=&lng=&gen=, with R but it is rather difficult.

I notice that the url link does not change whenever I click on a page number. Is this table created with JavaScript? Is the table created by some external source and how can I get access to it? Also, is there a technical name for this type of table?

Also, for anyone who knows web scraping with R or any other program, how would you extract all the data from this table? I tried using the following code in R to extract the data, but I get NULL. How would you address this issue?

mps <- paste("http://www.va.gov/providerinfo/SANDIEGO/index.asp?servicesearch=&specialtysearch=&gendersearch=&sort=&currentPage=1") 
mps.doc <- htmlParse(mps)
mps.tabs <- readHTMLTable(mps.doc)

Also, if you can not anwer the second half of my question, that is okay. I mainly want to know the answer of the first half of my question.

like image 904
mtber75 Avatar asked Dec 24 '12 23:12

mtber75


1 Answers

Answer Revised with 3 different techniques, all .ajax() and YQL based.

Technique 1

Reference HTML: http://doctors.ucsd.edu/?index=1

For the first part of your question, the type of table in the URL you provided is a standard HTML Table Model variety. In creating that table, the website uses a XML File to populate it's rows and columns with data, including the photo of the doctor.

To keep the servers happy, not all of the data from the XML File is loaded into the browser, only limited results are shown with the option to proceed to the next page.

This is also true for the URL link in the comments section you wrote about (i.e. http://doctors.ucsd.edu/?index=1) where the visitor can select 10, 25 or 50 results from the webpages Results Per Page dropdown menu. The web's address bar will show that number requested via &setsize=25 for example.

Although you may want to data scrape that reference URL, it's best not to since you already have the XML file with all the data you need. It's less work to access it directly!

Reference XML: http://www.rchsd.org/api/physdir/

The second part of your question is easy enough since the XML File is readily available. This time around, when you data scrape that reference XML File, it will show the information your looking for quickly and with very much readability too.

I've limited the request to 5 results for testing purposes in both data scraping queries above, but you can increase that to a larger sampling value. The amount of extra webpage data in the 1st example would require the use of XPATH to map out nodes and require extra processing to use that data.

I've prepared a detailed jsFiddle which should explain a lot of your questions about this process. In it, I explain how to use YQL, .ajax(), and the link for the XML File.


Reference Example:

$.ajax({
    type: 'GET',
    url: 'http://query.yahooapis.com/v1/public/yql?q=SELECT%20phys%20FROM%20xml%20WHERE%20url%3D%22http%3A%2F%2Fwww.rchsd.org%2Fapi%2Fphysdir%2F%22%20LIMIT%205',
    dataType: 'xml',
    success: function(data) {
        var dataResults = $(data).find('results');
        console.log(dataResults);
    }
});

Reference Tutorial:
jsFiddle Data Scraping XML Demo (See below for jsFiddle HTML Demo)


Technique 2

EDIT: Returning to original reference HTML: http://doctors.ucsd.edu/?index=1

The last thing I wrote in the first section is actually not true, as you don't necessarily have all the data you need. While you can create your own Google Map Location Data from the physical doctors address in the XML File, that information is already available to use.

It's then also discovered that this URL also contains a unique formatted Thumbnail Image and includes Doctors Information section when available.

So then, what follows is a re-written jsFiddle that shows how to data scrape that HTML webpage. You'll note in this new jsFiddle that the YQL Statement is no longer ACCESS phys FROM xml since we are now dealing with a HTML document. Also, we are going to use wild card * and not tagname phys in that YQL Statement. It will then be ACCESS * FROM html

As you remember from above data scraping 1st method, too much data was returned from that request. I'll explain how to add an XPATH to that YQL Statement so you only get the desired data back.

Where to start you ask? At that website in your browser! I'll use Firefox to continue.

First, let's force 5 results to be returned in our tests. To do this, change Results Per Page to 25, then at the browser bar change the 25 to 5 for &setsize= query. Hit enter on your keyboard to apply changes.

Using the webpages Additional Search Criteria, Show More Specialties, Location, and Sort results by: will also modify the browsers bar and further create a customize URL to use.

For our Demo, we need just 1 additional customization of Sort results by: Last Name A-Z. Reload the webpage if need be, and to be sure... our customized URL should look like:

http://doctors.ucsd.edu/?sortby=familyName&sortDirection=asc&setsize=5

Now that the webpage is populated with our requested 5 results, we need to see how the layout is supporting those items.

Use Firefox Inspect Element tool via right-click on your mouse to view and learn the table layout structure. Soon, you will see that all the results returned are enclosed in a unique class name.

Here is a screenshot of using Firefox to illustrate:

enter image description here

When popping up the HTML Panel via the icon at bottom of Inspect Element tools (to the right side of Inspect Element Icon), you can see how the layout is for that single Doctors box:

enter image description here

In the photo above, you can visually traverse up the DOM to see the main classname resultsList is the div holding the requested 5 results. That actual classname is can be used, but a more refined classname to use is the resultsListProvider that each returned item carries.

You now have the required information to construct a YQL Statement to use. First, here's the minimum we will use to get started:

ACCESS * FROM html WHERE url="http://doctors.ucsd.edu/?sortby=familyName&sortDirection=asc&setsize=5"

The above really will not do since it returns too much non essential webpage data, that's why we used Inspect Element to discover what's really important. That being said, we will use XPATH to access the part of the webpage we need via classname resultsListProvider.

xpath="//div[@class='resultsListProvider']"

Now we can combine both parts using AND to create the Final YQL Statement that we can data scrape:

SELECT * FROM html WHERE url="http://doctors.ucsd.edu/?sortby=familyName&sortDirection=asc&setsize=5" AND xpath="//div[@class='resultsListProvider']"

The Final YQL Statement above will now provide usable results to work with in a new jsFiddle I've created, which has updated comments to reflect these changes. If you need to, you can combine both XML File and HTML URL methods to satisfy your data scraping requirements, as each method provides content that other method may lack.

Reminder: Some data might be directly rendered when the webpage loads or when using YQL Rest State query. That means your dynamic data might be based on their dynamic data. Oh my!

Reference Tutorial:

jsFiddle Data Scraping HTML Demo (See above for jsFiddle XML Demo)


Technique 3

EDIT 2: Using HTML Directly

jsFiddle Data Scraping HTML Demo: Clone That Webpage

The latest edit shows how to use the original webpage's style sheets (which is optional, you can create your own), but requests the Ajax data differently using dataType attribute. Using this approach places the the exact markup on the local webpage, including any classnames or id's with it.

jsFiddle Screenshot: enter image description here

like image 129
arttronics Avatar answered Oct 01 '22 10:10

arttronics