I am trying to create a BASH script what would extract the data from HTML table. Below is the example of table from where I need to extract data: <pre class="prettyprint"><code><table border=1> <tr> <td>Component</td> <td>Status</td> <td>Time / Error</td> </tr> <tr><td>SAVE_DOCUMENT</td><td>OK</td><td>0.406 s</td></tr> <tr><td>GET_DOCUMENT</td><td>OK</td><td>0.332 s</td></tr> <tr><td>DVK_SEND</td><td>OK</td><td>0.001 s</td></tr> <tr><td>DVK_RECEIVE</td><td>OK</td><td>0.001 s</td></tr> <tr><td>GET_USER_INFO</td><td>OK</td><td>0.143 s</td></tr> <tr><td>NOTIFICATIONS</td><td>OK</td><td>0.001 s</td></tr> <tr><td>ERROR_LOG</td><td>OK</td><td>0.001 s</td></tr> <tr><td>SUMMARY_STATUS</td><td>OK</td><td>0.888 s</td></tr> </table> </code></pre> And I want the BASH script to output it like so: <pre class="prettyprint"><code>SAVE_DOCUMENT OK 0.475 s GET_DOCUMENT OK 0.345 s DVK_SEND OK 0.002 s DVK_RECEIVE OK 0.001 s GET_USER_INFO OK 4.465 s NOTIFICATIONS OK 0.001 s ERROR_LOG OK 0.002 s SUMMARY_STATUS OK 5.294 s </code></pre> How to do it? So far I have tried using the sed, but I don't know how to use it quite well. The header of the table(Component, Status, Time/Error) I excluded with grep using <code>grep "<tr><td></code>, so only lines starting with <code><tr><td></code> will be selected for next parsing (sed). This is what I used: <code>sed 's@<$[^<>][^<>]*$>$[^<>]*$</\1>@\2@g'</code> But then <code><tr></code> tags still remain and also it wont separate the strings. In other words the result of this script is: <pre class="prettyprint"><code><tr>SAVE_DOCUMENTOK0.406 s</tr> </code></pre> The full command of the script I'm working on is: <pre class="prettyprint"><code>cat $FILENAME | grep "<tr><td>" | sed 's@<$[^<>][^<>]*$>$[^<>]*$</\1>@\2@g' </code></pre>

Go with <code>(g)awk</code>, it's capable :-), here is a solution, but please note: it's only working with the exact html table format you had posted. <pre class="prettyprint"><code> awk -F "</*td>|</*tr>" '/<\/*t[rd]>.*[A-Z][A-Z]/ {print $3, $5, $7 }' FILE </code></pre> Here you can see it in action: https://ideone.com/zGfLe Some explanation: <ol> <li><code>-F</code> sets the input field separator to a regexp (any of <code>tr</code>'s or <code>td</code>'s opening or closing tag</li> <li>then works only on lines that matches those tags AND at least two upercasse fields</li> <li>then prints the needed fields.</li> </ol> HTH

You can use bash <code>xpath</code> (XML::XPath perl module) to accomplish that task very easily: <pre class="prettyprint"><code>xpath -e '//tr[position()>1]' test_input1.xml 2> /dev/null | sed -e 's/<\/*tr>//g' -e 's/<td>//g' -e 's/<\/td>/ /g' </code></pre>

How to extract data from html table in shell script?

Tags:

html

regex

shell

html-parsing

sed

I am trying to create a BASH script what would extract the data from HTML table. Below is the example of table from where I need to extract data:

<table border=1>
<tr>
<td><b>Component</b></td>
<td><b>Status</b></td>
<td><b>Time / Error</b></td>
</tr>
<tr><td>SAVE_DOCUMENT</td><td>OK</td><td>0.406 s</td></tr>
<tr><td>GET_DOCUMENT</td><td>OK</td><td>0.332 s</td></tr>
<tr><td>DVK_SEND</td><td>OK</td><td>0.001 s</td></tr>
<tr><td>DVK_RECEIVE</td><td>OK</td><td>0.001 s</td></tr>
<tr><td>GET_USER_INFO</td><td>OK</td><td>0.143 s</td></tr>
<tr><td>NOTIFICATIONS</td><td>OK</td><td>0.001 s</td></tr>
<tr><td>ERROR_LOG</td><td>OK</td><td>0.001 s</td></tr>
<tr><td>SUMMARY_STATUS</td><td>OK</td><td>0.888 s</td></tr>
</table>

And I want the BASH script to output it like so:

SAVE_DOCUMENT OK 0.475 s
GET_DOCUMENT OK 0.345 s
DVK_SEND OK 0.002 s
DVK_RECEIVE OK 0.001 s
GET_USER_INFO OK 4.465 s
NOTIFICATIONS OK 0.001 s
ERROR_LOG OK 0.002 s
SUMMARY_STATUS OK 5.294 s

How to do it?

So far I have tried using the sed, but I don't know how to use it quite well. The header of the table(Component, Status, Time/Error) I excluded with grep using grep "<tr><td>, so only lines starting with <tr><td> will be selected for next parsing (sed). This is what I used: sed 's@<$[^<>][^<>]*$>$[^<>]*$</\1>@\2@g' But then <tr> tags still remain and also it wont separate the strings. In other words the result of this script is:

<tr>SAVE_DOCUMENTOK0.406 s</tr>

The full command of the script I'm working on is:

cat $FILENAME | grep "<tr><td>" | sed 's@<\([^<>][^<>]*\)>\([^<>]*\)</\1>@\2@g'

734

asked Jul 28 '11 05:07

Marko

3 Answers

You may use html2text command and format the columns via column, e.g.:

$ html2text table.html | column -ts'|'

Component                                      Status  Time / Error
SAVE_DOCUMENT                                           OK            0.406 s     
GET_DOCUMENT                                            OK            0.332 s     
DVK_SEND                                                OK            0.001 s     
DVK_RECEIVE                                             OK            0.001 s     
GET_USER_INFO                                           OK            0.143 s     
NOTIFICATIONS                                           OK            0.001 s     
ERROR_LOG                                               OK            0.001 s     
SUMMARY_STATUS                                          OK            0.888 s

then parse it further from there (e.g. cut, awk, ex).

In case you'd like to sort it first, you can use ex, see the example here or here.

188

answered Sep 24 '22 22:09

kenorb

Go with (g)awk, it's capable :-), here is a solution, but please note: it's only working with the exact html table format you had posted.

 awk -F "</*td>|</*tr>" '/<\/*t[rd]>.*[A-Z][A-Z]/ {print $3, $5, $7 }' FILE

Here you can see it in action: https://ideone.com/zGfLe

Some explanation:

-F sets the input field separator to a regexp (any of tr's or td's opening or closing tag
then works only on lines that matches those tags AND at least two upercasse fields
then prints the needed fields.

HTH

answered Oct 18 '22 00:10

Zsolt Botykai

You can use bash xpath (XML::XPath perl module) to accomplish that task very easily:

xpath -e '//tr[position()>1]' test_input1.xml 2> /dev/null | sed -e 's/<\/*tr>//g' -e 's/<td>//g' -e 's/<\/td>/ /g'

answered Oct 17 '22 23:10

Emiliano Poggi

Related questions
                            
                                Javascript - onchange within <option>
                            
                                How to specify language of website? (HTML?)
                            
                                HTML Tables - How to make IE not break lines at hyphens
                            
                                Convert tags to html entities
                            
                                In Rails, how can I allow some html in a text area?
                            
                                HTML5 Type Detection and Plugin Initialization
                            
                                Setting the text of an <option> element using jQuery
                            
                                Create divs from Array elements
                            
                                Character Limit On Textbox
                            
                                List items run outside of list and div area
                            
                                vertical-align:text-top; not working in table cell (td) in HTML5
                            
                                JavaScript - get length of list options
                            
                                Change input text border color without changing its height
                            
                                Contact form 7 set field value with get request
                            
                                Bootstrap different img size at lg, md sm and xs
                            
                                :before to all children except :first-child
                            
                                text align right in a span
                            
                                Bootstrap center align a dropdown menu
                            
                                Problem with IE when using display:block for links
                            
                                Hide Referrer on click

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With