Extract part of the code and parse HTML in bash

Question

I have external HTML site and I need to extract data from the table on that site. However source of the HTML website has wrong formatting except the table in the code, so I can not use

xmllint --html --xpath <xpath> <file>

because it does not work properly, when HTML formatting on the site is broken.

My idea was to use curl and delete code above and below the table. When table is extracted, code is clean and it fits to xmllint tool (I can use xpath then). However delete everything above the match is challenging for shell as you can see here: Sed doesn't backtrack: once it's processed a line, it's done. Is there a way how to extract only the code of the table from the HTML site in bash? Suposse, code has this structure.

<html>
<head>
</head>
<body>
<p>Lorem ipsum ....</p>
  <table class="my-table">
    <tr>
      <th>Company</th>
      <th>Contact</th>
    </tr>
  </table>
<p>... dolor.</p>
</body>
</html>

And I need output like this to parse data properly:

  <table class="my-table">
    <tr>
      <th>Company</th>
      <th>Contact</th>
    </tr>
  </table>

Please, do not give me minus because of trying to use bash.

Inian · Accepted Answer

I will break down the answer which I tried using xmllint which supports a --html flag for parsing html files

Firstly you can check the sanity of your HTML file by parsing it as below which confirms if the file is as per the standards or throws out errors if seen:-

$ xmllint --html YourHTML.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
</head>
<body>
<p>Lorem ipsum ....</p>
  <table class="my-table">
    <tr>
      <th>Company</th>
      <th>Contact</th>
    </tr>
  </table>
<p>... dolor.</p>
</body>
</html>

with my original YourHTML.html file just being the input HTML file in your question.

Now for the value extraction part:-

Starting the file parsing from root-node to the table node (//html/body/table) and running xmllint in HTML parser & interactive shell mode (xmllint --html --shell)

Running the command plainly produces a result,

$ echo "cat //html/body/table" |  xmllint --html --shell YourHTML.html
/ >  -------
<table class="my-table">
    <tr>
      <th>Company</th>
      <th>Contact</th>
    </tr>
  </table>
/ >

Now removing the special characters using sed i.e. sed '/^\/ >/d' produces

$ echo "cat //html/body/table" |  xmllint --html --shell YourHTML.html | sed '/^\/ >/d'
<table class="my-table">
    <tr>
      <th>Company</th>
      <th>Contact</th>
    </tr>
  </table>

which is the output structure as you expected. Tested on xmllint: using libxml version 20900

I will go one more step ahead, and if you want to fetch the values within the table tag, you can apply the sed command to extract them as

$ echo "cat //html/body/table" |  xmllint --html --shell YourHTML.html | sed '/^\/ >/d' | sed 's/<[^>]*.//g' | xargs
Company Contact

experiment.pl · Answer

For your purposes a quick solution would be a 1-liner:

sed -n '/<table class="my-table">/,/<\/table>/p'  <file>

Explanation: print everything between two specified tags, in this case <table>

You could also easily make a tag variable for e.g <body> or <p> and change the output on the fly. But the above solution gives what you asked for without external tools.

Extract part of the code and parse HTML in bash

Tags:

bash

html-parsing

sed

Pavol Travnik

2 Answers

Inian

experiment.pl

Recent Activity

Donate For Us

Extract part of the code and parse HTML in bash

Tags:

bash

html-parsing

sed

Pavol Travnik

2 Answers

Inian

experiment.pl

Related questions

Recent Activity

Donate For Us