Looking at HTML source code of
http://www.google.com/finance/historical?cid=983582&startdate=Nov+28,+2000&enddate=Nov+27,+2010&num=200
I see that Google never closes td
and tr
tags. There is no </tr>
no </td>
in the source.
Why?
<tr class=bb>
<th class="bb lm">Date
<th class="rgt bb">Open
<th class="rgt bb">High
<th class="rgt bb">Low
<th class="rgt bb">Close
<th class="rgt bb rm">Volume
<tr>
<td class="lm">Nov 26, 2010
<td class="rgt">11,183.50
<td class="rgt">11,183.50
<td class="rgt">11,067.17
<td class="rgt">11,092.00
<td class="rgt rm">68,396,121
<tr>
Is it to make it harder to parse it because XML parser won't be able to read it ? I have remarked that &output=csv is not available for indices (this url won't work: http://www.google.com/finance?q=INDEXDJX:.DJI&output=csv) whereas it is available for stock (http://www.google.com/finance/historical?q=NASDAQ:GOOG&output=csv will work) so that to get historical data in csv for indices you have to do the parsing job !
This is HTML4 (and not XML). As pointed out in the W3 specs:
11.2.6 Table cells: The TH and TD elements
…
Start tag: required, End tag: optional
Ditto for tr
:
11.2.5 Table rows: The TR element
…
Start tag: required, End tag: optional
I believe the intent is to minimize page size by omitting the end tags. They do various additional optimizations which may actually result in invalid HTML, but are handled by browsers in tagsoup mode.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With