Get content between a pair of HTML tags using Bash

Question

I need to get the HTML contents between a pair of given tags using a bash script. As an example, having the HTML code below:

<html>
<head>
</head>
<body>
 text
  <div>
  text2
    <div>
        text3
    </div>
  </div>
</body>
</html>

Using the bash command/script, given the body tag, we would get:

 text
  <div>
  text2
    <div>
    text3
    </div>
  </div>

Thanks in advance.

BMW · Accepted Answer

Using sed in shell/bash, so you needn't install something else.

tag=body
sed -n "/<$tag>/,/<\/$tag>/p" file

Kent · Answer

plain text processing is not good for html/xml parsing. I hope this could give you some idea:

kent$  xmllint --xpath "//body" f.html 
<body>
 text
  <div>
  text2
    <div>
        text3
    </div>
  </div>
</body>

Cromax · Answer

Personally I find it very useful to use hxselect command (often with help of hxclean) from package html-xml-utils. The latter fixes (sometimes broken) HTML file to correct XML file and the first one allows to use CSS selectors to get the node(s) you need. With use of the -c option, it strips surrounding tags. All these commands work on stdin and stdout. So in your case you should execute:

$ hxselect -c body <<HTML
  <html>
  <head>
  </head>
  <body>
    text
    <div>
      text2
      <div>
        text3
      </div>
    </div>
  </body>
  </html>
  HTML

to get what you need. Plain and simple.

$ hxselect -c body <<HTML
  <html>
  <head>
  </head>
  <body>
    text
    <div>
      text2
      <div>
        text3
      </div>
    </div>
  </body>
  </html>
  HTML

to get what you need. Plain and simple.

Paulo Fidalgo · Answer

Forgetting Bash due it's limitation, you can use nokogiri as command line util, as explained here.

Example:

curl -s http://example.com/ | nokogiri -e 'puts $_.search('\''a'\'')'

mklement0 · Answer

Another option is to use the multi-platform xidel utility (home page on SourceForge, GitHub repository), which can handle both XML and HTML:

xidel -s in.html -e '/html/body/node()' --printed-node-format=html

The above prints the resulting HTML with syntax highlighting (colored), and seemingly with an empty line after the text node.

If you want the text only, Reino points out that you can simplify to:

xidel -s in.html -e '/html/body/inner-html()'

Get content between a pair of HTML tags using Bash

Tags:

html

bash

Joao

5 Answers

BMW

Kent

Cromax

Paulo Fidalgo

mklement0

Recent Activity

Donate For Us

Get content between a pair of HTML tags using Bash

Tags:

html

bash

Joao

5 Answers

BMW

Kent

Cromax

Paulo Fidalgo

mklement0

Related questions

Recent Activity

Donate For Us