I seek a tool that can be run on the command line like so:
tablescrape 'http://someURL.foo.com' [n]
If n
is not specified and there's more than one HTML table on the page, it should summarize them (header row, total number of rows) in a numbered list.
If n
is specified or if there's only one table, it should parse the table and spit it to stdout as CSV or TSV.
Potential additional features:
What would you use to cobble something like this together? The Perl module HTML::TableExtract might be a good place to start and can even handle the case of nested tables. This might also be a pretty short Python script with BeautifulSoup. Would YQL be a good starting point? Or, ideally, have you written something similar and have a pointer to it? (I'm surely not the first person to need this.)
Related questions:
This is my first attempt:
http://yootles.com/outbox/tablescrape.py
It needs a bit more work, like better asciifying, but it's usable. For example, if you point it at this list of Olympic records:
./tablescrape http://en.wikipedia.org/wiki/List_of_Olympic_records_in_athletics
it tells you that there are 8 tables available and it's clear that the 2nd and 3rd ones (men's and women's records) are the ones you want:
1: [ 1 cols, 1 rows] Contents 1 Men's rec
2: [ 7 cols, 25 rows] Event | Record | Name | Nation | Games | Date | Ref
3: [ 7 cols, 24 rows] Event | Record | Name | Nation | Games | Date | Ref
[...]
Then if you run it again, asking for the 2nd table,
./tablescrape http://en.wikipedia.org/wiki/List_of_Olympic_records_in_athletics 2
You get a reasonable plaintext data table:
100 metres | 9.69 | Usain Bolt | Jamaica (JAM) | 2008 Beijing | August 16, 2008 | [ 8 ]
200 metres | 19.30 | Usain Bolt | Jamaica (JAM) | 2008 Beijing | August 20, 2008 | [ 8 ]
400 metres | 43.49 | Michael Johnson | United States (USA) | 1996 Atlanta | July 29, 1996 | [ 9 ]
800 metres | 1:42.58 | Vebjørn Rodal | Norway (NOR) | 1996 Atlanta | July 31, 1996 | [ 10 ]
1,500 metres | 3:32.07 | Noah Ngeny | Kenya (KEN) | 2000 Sydney | September 29, 2000 | [ 11 ]
5,000 metres | 12:57.82 | Kenenisa Bekele | Ethiopia (ETH) | 2008 Beijing | August 23, 2008 | [ 12 ]
10,000 metres | 27:01.17 | Kenenisa Bekele | Ethiopia (ETH) | 2008 Beijing | August 17, 2008 | [ 13 ]
Marathon | 2:06:32 | Samuel Wanjiru | Kenya (KEN) | 2008 Beijing | August 24, 2008 | [ 14 ]
[...]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With