right-click anywhere in the table and select 'copy whole table' start up a spreadsheet application such as LibreOffice Calc. paste into the spreadsheet (select appropriate separator character as needed) save/export the spreadsheet as CSV.
You can convert an Excel worksheet to a text file by using the Save As command. Go to File > Save As. Click Browse. In the Save As dialog box, under Save as type box, choose the text file format for the worksheet; for example, click Text (Tab delimited) or CSV (Comma delimited).
Any HTML table that you have created can be converted into an Excel Spreadsheet by using jQuery and it is compatible with all browsers.
This method is not really a library OR a program, but for ad hoc conversions you can
I know this works with Excel, and I believe I've done it with the OpenOffice spreadsheet.
But you probably would prefer a Perl or Ruby script...
Sorry for resurrecting an ancient thread, but I recently wanted to do this, but I wanted a 100% portable bash script to do it. So here's my solution using only grep and sed.
The below was bashed out very quickly, and so could be made much more elegant, but I'm just getting started really with sed/awk etc...
curl "http://www.webpagewithtableinit.com/" 2>/dev/null | grep -i -e '</\?TABLE\|</\?TD\|</\?TR\|</\?TH' | sed 's/^[\ \t]*//g' | tr -d '\n' | sed 's/<\/TR[^>]*>/\n/Ig' | sed 's/<\/\?\(TABLE\|TR\)[^>]*>//Ig' | sed 's/^<T[DH][^>]*>\|<\/\?T[DH][^>]*>$//Ig' | sed 's/<\/T[DH][^>]*><T[DH][^>]*>/,/Ig'
As you can see I've got the page source using curl, but you could just as easily feed in the table source from elsewhere.
Here's the explanation:
Get the Contents of the URL using cURL, dump stderr to null (no progress meter)
curl "http://www.webpagewithtableinit.com/" 2>/dev/null
.
I only want Table elements (return only lines with TABLE,TR,TH,TD tags)
| grep -i -e '</\?TABLE\|</\?TD\|</\?TR\|</\?TH'
.
Remove any Whitespace at the beginning of the line.
| sed 's/^[\ \t]*//g'
.
Remove newlines
| tr -d '\n\r'
.
Replace </TR>
with newline
| sed 's/<\/TR[^>]*>/\n/Ig'
.
Remove TABLE and TR tags
| sed 's/<\/\?\(TABLE\|TR\)[^>]*>//Ig'
.
Remove ^<TD>
, ^<TH>
, </TD>$
, </TH>$
| sed 's/^<T[DH][^>]*>\|<\/\?T[DH][^>]*>$//Ig'
.
Replace </TD><TD>
with comma
| sed 's/<\/T[DH][^>]*><T[DH][^>]*>/,/Ig'
.
Note that if any of the table cells contain commas, you may need to escape them first, or use a different delimiter.
Hope this helps someone!
Here's a ruby script that uses nokogiri -- http://nokogiri.rubyforge.org/nokogiri/
require 'nokogiri'
doc = Nokogiri::HTML(table_string)
doc.xpath('//table//tr').each do |row|
row.xpath('td').each do |cell|
print '"', cell.text.gsub("\n", ' ').gsub('"', '\"').gsub(/(\s){2,}/m, '\1'), "\", "
end
print "\n"
end
Worked for my basic test case.
Here's a short Python program I wrote to complete this task. It was written in a couple of minutes, so it can probably be made better. Not sure how it'll handle nested tables (probably it'll do bad stuff) or multiple tables (probably they'll just appear one after another). It doesn't handle colspan
or rowspan
.
Enjoy.
from HTMLParser import HTMLParser
import sys
import re
class HTMLTableParser(HTMLParser):
def __init__(self, row_delim="\n", cell_delim="\t"):
HTMLParser.__init__(self)
self.despace_re = re.compile(r'\s+')
self.data_interrupt = False
self.first_row = True
self.first_cell = True
self.in_cell = False
self.row_delim = row_delim
self.cell_delim = cell_delim
def handle_starttag(self, tag, attrs):
self.data_interrupt = True
if tag == "table":
self.first_row = True
self.first_cell = True
elif tag == "tr":
if not self.first_row:
sys.stdout.write(self.row_delim)
self.first_row = False
self.first_cell = True
self.data_interrupt = False
elif tag == "td" or tag == "th":
if not self.first_cell:
sys.stdout.write(self.cell_delim)
self.first_cell = False
self.data_interrupt = False
self.in_cell = True
def handle_endtag(self, tag):
self.data_interrupt = True
if tag == "td" or tag == "th":
self.in_cell = False
def handle_data(self, data):
if self.in_cell:
#if self.data_interrupt:
# sys.stdout.write(" ")
sys.stdout.write(self.despace_re.sub(' ', data).strip())
self.data_interrupt = False
parser = HTMLTableParser()
parser.feed(sys.stdin.read())
Just to add to these answers (as i've recently been attempting a similar thing) - if Google spreadsheets is your spreadsheeting program of choice. Simply do these two things.
1. Strip everything out of your html file around the Table opening/closing tags and resave it as another html file.
2. Import that html file directly into google spreadsheets and you'll have your information beautifully imported (Top tip: if you used inline styles in your table, they will be imported as well!)
Saved me loads of time and figuring out different conversions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With