Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bash: format list elements in HTML

I have no bash experience, just want to know how to get started.

I have to write a bash script that properly formats an XHTML document. For example turns this:

   <p>Test</p><ol><li>Test
    </li><li>
    Test</li></ol>

into this:

<p>Test</p>
<ol>
  <li>Test</li>
  <li>Test</li>
</ol>

Now I believe I have to do something like:

cat > format1 #create file
#!bin/bash
if tail of a line ends with "</A-a>": (like </li> or </ol> or </p> or </ul>)
    add \n 
    fi

if head of a line = <ol> or <ul>
    add \n
    fi

Please help me understand it. This is all I can think of and I really would like to know how to solve it.

like image 298
Jp Morgan Avatar asked May 06 '15 03:05

Jp Morgan


2 Answers

Given the constraints that the problem must be solved with a bash script and you cannot use htmltidy, then I'd get started by creating a file htmltidy.sh which contains:

#!/bin/bash

echo $( cat )                       |\
    sed 's/\s*\(<[^>]\+>\)\s*/\1/g' |\
    sed 's/></>\n</g'               |\
    awk '{
        if ( $0 ~ /^<\/[^>]+>$/ ) indent=substr(indent,2);
        print indent$0;
        if ( $0 ~ /^<[^\/>][^>]+>$/ ) indent=indent" ";
    }'

To use this program you'll pipe the content into it like this:

cat sexist.html | ./xhtmltidy.sh

This will at least do the trick given the sample input that you provided.

Some explanation:

  • cat captures all of stdin as a single line of text
  • sed strips leading and trailing space for XHTML tags
  • sed puts a newline between adjacent XHTML tags
  • awk reduces indent if a line is an ending XHTML tag (such as )
  • awk prints the line with indent
  • awk increases indent if a line is an starting XHTML tag (such as )

This toy program will break very quickly as soon as the complexity of the input starts getting more complex. But that will give you some idea why it's better to use an off the shelf utility rather than write your own.

like image 115
ddoxey Avatar answered Sep 22 '22 14:09

ddoxey


Use html-tidy. It would be a good idea to add this to your .bashrc if you wish to use tidy

alias tidy="tidy -xml --indent auto --indent-spaces 1 --quiet yes -im"

The above command creates an alias for tidy that says to indent the file as xml (ensures all tags have closing tags), indent with a single space and modifies the file in place.

like image 20
rjv Avatar answered Sep 23 '22 14:09

rjv