I need to get the HTML contents between a pair of given tags using a bash script. As an example, having the HTML code below:
<html>
<head>
</head>
<body>
text
<div>
text2
<div>
text3
</div>
</div>
</body>
</html>
Using the bash command/script, given the body tag, we would get:
text
<div>
text2
<div>
text3
</div>
</div>
Thanks in advance.
Using sed in shell/bash, so you needn't install something else.
tag=body
sed -n "/<$tag>/,/<\/$tag>/p" file
plain text processing is not good for html/xml parsing. I hope this could give you some idea:
kent$ xmllint --xpath "//body" f.html
<body>
text
<div>
text2
<div>
text3
</div>
</div>
</body>
Personally I find it very useful to use hxselect
command (often with help of hxclean
) from package html-xml-utils. The latter fixes (sometimes broken) HTML file to correct XML file and the first one allows to use CSS selectors to get the node(s) you need. With use of the -c
option, it strips surrounding tags. All these commands work on stdin and stdout. So in your case you should execute:
$ hxselect -c body <<HTML
<html>
<head>
</head>
<body>
text
<div>
text2
<div>
text3
</div>
</div>
</body>
</html>
HTML
to get what you need. Plain and simple.
Forgetting Bash due it's limitation, you can use nokogiri as command line util, as explained here.
Example:
curl -s http://example.com/ | nokogiri -e 'puts $_.search('\''a'\'')'
Another option is to use the multi-platform xidel
utility (home page on SourceForge, GitHub repository), which can handle both XML and HTML:
xidel -s in.html -e '/html/body/node()' --printed-node-format=html
The above prints the resulting HTML with syntax highlighting (colored), and seemingly with an empty line after the text
node.
If you want the text only, Reino points out that you can simplify to:
xidel -s in.html -e '/html/body/inner-html()'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With