Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse HTML table columns with bash

I'm trying to extract 3 columns from a table in HTML. I need hostname, product + region and date added. So they would be columns 1, 3, 4.

<div class="table sectionedit2">
  <table class="inline">
    <tr class="row0">
      <th class="col0 centeralign">hostname</th>
      <th class="col1 centeralign">AKA (Client hostname)</th>
      <th class="col2 leftalign">Product + Region</th>
      <th class="col3 centeralign">date added</th>
      <th class="col4 centeralign">  decom. date  </th>
      <th class="col5 centeralign">           builder           </th>
      <th class="col6 centeralign">  build cross-checker  </th>
      <th class="col7 leftalign"> <strong>decommissioner</strong></th>
      <th class="col8 centeralign">customer managed filesystems</th>
      <th class="col9 centeralign">  only company has root?  </th>
    </tr>
    <tr class="row1">
      <th class="col0 centeralign">HostName01</th>
      <td class="col1 leftalign">Host01</td>
      <td class="col2 leftalign">EU</td>
      <td class="col3 centeralign">2007-01-01</td>
      <td class="col4 leftalign"></td>
      <td class="col5 centeralign">Me</td>
      <td class="col6 centeralign">You</td>
      <td class="col7 leftalign">Builder01</td>
      <td class="col8 leftalign">xChecker01</td>
      <td class="col9 centeralign">yes</td>
    </tr>
   <tr class="row2">
     <th class="col0 centeralign">HostName02</th>
     <td class="col1 leftalign">Host02</td>
     <td class="col2 leftalign">U.S</td>
     <td class="col3 centeralign">2008-09-29</td>
     <td class="col4 leftalign"></td>
     <td class="col5 leftalign">Me01</td>
     <td class="col6 leftalign">You01</td>
     <td class="col7 leftalign">Builder02</td>
     <td class="col8 leftalign">xChecker02</td>
     <td class="col9 centeralign">yes</td>

I want to get:

Hostname     Product + Region   Date added

HostName01   EU                 2007-01-01

HostName02   U.S                2008-09-29

Previously I tried stripping the HTML tags and using awk, although some of the columns in the table are empty. This means I didn't get colums 1, 3 and 4 for all the rows.

I am trying to use:

xmllint --html --shell --format table.log <<< "cat //table/tr/th/td[1]/text()"

This is giving me the second column, I tried "[0]" which doesn't work and I'm not sure how to get multiple columns at once.

like image 885
noooob Avatar asked Oct 12 '25 10:10

noooob


2 Answers

You can do the following:

  • run xmllint --xpath with an XPath expression that uses position()= to grab just columns 1, 3, and 4: //table/tr/*[position()=1 or position()=3 or position()=4]
  • pipe through perl -pe "s/<th class=\"col0/\n<th class=\"col0/g", etc., to strip out the markup and break it up into separate lines
  • pipe through grep -v '^\s*$' to strip out blank lines
  • pipe through column -t at the end to pretty-print it

Like this:

xmllint --html \
  --xpath "//table/tr/*[position()=1 or position()=3 or position()=4]" \
    table.log \
    | perl -pe "s/<th class=\"col0/\n<th class=\"col0/g" \
    | perl -pe 's/<tr[^>]+>//' \
    | perl -pe 's/<\/tr>//' \
    | perl -pe 's/<t[dh][^>]*>//' \
    | perl -pe 's/<\/t[dh]><t[dh][^>]*>/|/g' \
    | perl -pe 's/<\/t[dh]>//' \
    | grep -v '^\s*$' \
    | column -t -s '|'

The above assumes the HTML document is in the file table.log (which seems like an odd name for an HTML file, but it appears to be the name that’s used in the question…). If the document is actually in some other *.html file, of course just put the actual filename.

That will give you output like this:

hostname    Product + Region  date added
HostName01  EU                2007-01-01
HostName02  U.S               2008-09-29
like image 112
6 revssideshowbarker Avatar answered Oct 16 '25 09:10

6 revssideshowbarker


Assuming your html is well-formed xml, xmlstarlet can do it:

xmlstarlet sel -t -m '//table/tr' -v '*[contains(@class,"col0")]' -o $'\t' \
                                  -v '*[contains(@class,"col2")]' -o $'\t' \
                                  -v '*[contains(@class,"col3")]' -n       \
    file.html
hostname    Product + Region    date added
HostName01  EU  2007-01-01
HostName02  U.S 2008-09-29
like image 34
glenn jackman Avatar answered Oct 16 '25 09:10

glenn jackman



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!