How can I convert an HTML table to CSV?

People also ask

How do I export HTML table data as .CSV file?

right-click anywhere in the table and select 'copy whole table' start up a spreadsheet application such as LibreOffice Calc. paste into the spreadsheet (select appropriate separator character as needed) save/export the spreadsheet as CSV.

How do I convert a table to CSV?

You can convert an Excel worksheet to a text file by using the Save As command. Go to File > Save As. Click Browse. In the Save As dialog box, under Save as type box, choose the text file format for the worksheet; for example, click Text (Tab delimited) or CSV (Comma delimited).

Can I convert HTML table to Excel?

Any HTML table that you have created can be converted into an Excel Spreadsheet by using jQuery and it is compatible with all browsers.

This method is not really a library OR a program, but for ad hoc conversions you can

put the HTML for a table in a text file called something.xls
open it with a spreadsheet
save it as CSV.

I know this works with Excel, and I believe I've done it with the OpenOffice spreadsheet.

But you probably would prefer a Perl or Ruby script...

Sorry for resurrecting an ancient thread, but I recently wanted to do this, but I wanted a 100% portable bash script to do it. So here's my solution using only grep and sed.

The below was bashed out very quickly, and so could be made much more elegant, but I'm just getting started really with sed/awk etc...

curl "http://www.webpagewithtableinit.com/" 2>/dev/null | grep -i -e '</\?TABLE\|</\?TD\|</\?TR\|</\?TH' | sed 's/^[\ \t]*//g' | tr -d '\n' | sed 's/<\/TR[^>]*>/\n/Ig'  | sed 's/<\/\?\(TABLE\|TR\)[^>]*>//Ig' | sed 's/^<T[DH][^>]*>\|<\/\?T[DH][^>]*>$//Ig' | sed 's/<\/T[DH][^>]*><T[DH][^>]*>/,/Ig'

As you can see I've got the page source using curl, but you could just as easily feed in the table source from elsewhere.

Here's the explanation:

Get the Contents of the URL using cURL, dump stderr to null (no progress meter)

curl "http://www.webpagewithtableinit.com/" 2>/dev/null

I only want Table elements (return only lines with TABLE,TR,TH,TD tags)

| grep -i -e '</\?TABLE\|</\?TD\|</\?TR\|</\?TH'

Remove any Whitespace at the beginning of the line.

| sed 's/^[\ \t]*//g'

Remove newlines

| tr -d '\n\r'

Replace </TR> with newline

| sed 's/<\/TR[^>]*>/\n/Ig'

Remove TABLE and TR tags

| sed 's/<\/\?\(TABLE\|TR\)[^>]*>//Ig'

Remove ^<TD>, ^<TH>, </TD>$, </TH>$

| sed 's/^<T[DH][^>]*>\|<\/\?T[DH][^>]*>$//Ig'

Replace </TD><TD> with comma

| sed 's/<\/T[DH][^>]*><T[DH][^>]*>/,/Ig'

Note that if any of the table cells contain commas, you may need to escape them first, or use a different delimiter.

Hope this helps someone!

Here's a ruby script that uses nokogiri -- http://nokogiri.rubyforge.org/nokogiri/

require 'nokogiri'

doc = Nokogiri::HTML(table_string)

doc.xpath('//table//tr').each do |row|
  row.xpath('td').each do |cell|
    print '"', cell.text.gsub("\n", ' ').gsub('"', '\"').gsub(/(\s){2,}/m, '\1'), "\", "
  end
  print "\n"
end

Worked for my basic test case.

Here's a short Python program I wrote to complete this task. It was written in a couple of minutes, so it can probably be made better. Not sure how it'll handle nested tables (probably it'll do bad stuff) or multiple tables (probably they'll just appear one after another). It doesn't handle colspan or rowspan. Enjoy.

from HTMLParser import HTMLParser
import sys
import re


class HTMLTableParser(HTMLParser):
    def __init__(self, row_delim="\n", cell_delim="\t"):
        HTMLParser.__init__(self)
        self.despace_re = re.compile(r'\s+')
        self.data_interrupt = False
        self.first_row = True
        self.first_cell = True
        self.in_cell = False
        self.row_delim = row_delim
        self.cell_delim = cell_delim

    def handle_starttag(self, tag, attrs):
        self.data_interrupt = True
        if tag == "table":
            self.first_row = True
            self.first_cell = True
        elif tag == "tr":
            if not self.first_row:
                sys.stdout.write(self.row_delim)
            self.first_row = False
            self.first_cell = True
            self.data_interrupt = False
        elif tag == "td" or tag == "th":
            if not self.first_cell:
                sys.stdout.write(self.cell_delim)
            self.first_cell = False
            self.data_interrupt = False
            self.in_cell = True

    def handle_endtag(self, tag):
        self.data_interrupt = True
        if tag == "td" or tag == "th":
            self.in_cell = False

    def handle_data(self, data):
        if self.in_cell:
            #if self.data_interrupt:
            #   sys.stdout.write(" ")
            sys.stdout.write(self.despace_re.sub(' ', data).strip())
            self.data_interrupt = False


parser = HTMLTableParser() 
parser.feed(sys.stdin.read())

Just to add to these answers (as i've recently been attempting a similar thing) - if Google spreadsheets is your spreadsheeting program of choice. Simply do these two things.

1. Strip everything out of your html file around the Table opening/closing tags and resave it as another html file.

2. Import that html file directly into google spreadsheets and you'll have your information beautifully imported (Top tip: if you used inline styles in your table, they will be imported as well!)

Saved me loads of time and figuring out different conversions.

Related questions
                            
                                Is it possible to put binary image data into html markup and then get the image displayed as usual in any browser?
                            
                                How to create custom tags for html [closed]
                            
                                How to identify HTML5
                            
                                .htm or .html extension - which one is correct and what is different?
                            
                                How to Clear/Remove JavaScript Event Handler?
                            
                                HTML/CSS font color vs span style
                            
                                Twitter Bootstrap Use collapse.js on table cells [Almost Done]
                            
                                HTML input type=file, get the image before submitting the form
                            
                                Making radio buttons look like buttons instead
                            
                                How to make a checkbox checked with jQuery? [duplicate]
                            
                                Saving binary data as file using JavaScript from a browser
                            
                                text-align: right; only for placeholder?
                            
                                Word wrap a link so it doesn't overflow its parent div width [duplicate]
                            
                                Why is my HTML table not respecting my CSS column width?
                            
                                HTML input element wider than Containing Div
                            
                                Reset select value to default
                            
                                Dynamically using the first frame as poster in HTML5 video?
                            
                                Center Triangle at Bottom of Div
                            
                                Add / remove input field dynamically with jQuery
                            
                                How can I set the form action through JavaScript?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I convert an HTML table to CSV?

Tags:

html

html-table

csv

People also ask

Recent Activity

Donate For Us