Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python / web scrape / aspx -- is it humanly possible when there are no forms?

Total noob, obviously. Teaching self Python for web scraping in the interest of open records/government transparency/reporting/etc.

There's an .aspx page I want to scrape, a week-by-week calendar for January - March 2012

But it has no forms ...

Perhaps you fine people can tell me if a solution is even possible before I spend days fighting with it.

http://webmail.legis.ga.gov/Calendar/default.aspx?chamber=house

The only way to see the appointments on the calendar is by choosing a day on a picture of a calendar. But, at least, if you click on Monday, it shows all the week's appointments. (I would like to gather all those appointments in order to count how often each committee meets, a bit of a proxy for counting what kind of legislation gets attention and what kind is ignored.)

But so, what strategy to use? It appears that each month at least down in its bowels is assigned to a sequential four-digit number prepended with a "V", like V4414, and days with a non-prepended number.

I'm only hunting Jan - Mar 2012; other months are non-germane and mostly empty.

a clue?

    ...<a href="javascript:__doPostBack('calMain','V4414')" style="color:#333333" title="Go to the previous month">February</a></td><td align="center" style="width:70%;">March 2012</td><td align="right" valign="bottom" style="color:#333333;font-size:8pt;font-weight:bold;width:15%;"><a href="javascript:__doPostBack('calMain','V4474')" style="color:#333333" title="Go to the next month">April</a></td></tr> 

a pattern?

    ...<td align="center" style="color:#999999;width:14%;"><a      href="javascript:__doPostBack('calMain','4439')" style="color:#999999" title="February 26">26</a></td><td align="center" style="color:#999999;width:14%;"><a href="javascript:__doPostBack('calMain','4440')" style="color:#999999" title="February 27">27</a></td><td align="center" style="color:#999999;width:14%;"><a href="javascript:__doPostBack('calMain','4441')" style="color:#999999" title="February 28">28</a></td>...

Cheers and thanks!!

like image 694
greencracker Avatar asked May 04 '12 03:05

greencracker


People also ask

How do you make Webscraping faster?

Minimize the number of requests sent If you can reduce the number of requests sent, your scraper will be much faster. For example, if you are scraping prices and titles from an e-commerce site, then you don't need to visit each item's page. You can get all the data you need from the results page.

What is Web scraping in Python?

Web scraping is a term used to describe the use of a program or algorithm to extract and process large amounts of data from the web. Whether you are a data scientist, engineer, or anybody who analyzes large amounts of datasets, the ability to scrape data from the web is a useful skill to have.


2 Answers

The form contains four inputs with names of:

  • __EVENTTARGET
  • __EVENTARGUMENT
  • __VIEWSTATE
  • __EVENTVALIDATION

The latter two have initial values. You need to scrape those. The former two are set by those links. For example, you have a link:

<a href="javascript:__doPostBack('calMain','4504')" style="color:Black" title="May 01">1</a>

Look at the href:

javascript:__doPostBack('calMain','4504')

Somehow, parse those two strings out of it. The former is __EVENTTARGET. The latter is __EVENTARGUMENT.

Once you have all four pieces of data, you can issue a POST request to get the next page.

like image 96
icktoofay Avatar answered Sep 24 '22 00:09

icktoofay


You could replicate the POST request in Python using something like urllib.parse.urlencode to build the query strings.

For this you will have to find out what the query string looks like, obviously. Or, you could use another tool like Selenium RC.

like image 31
Russell Dias Avatar answered Sep 25 '22 00:09

Russell Dias