Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get content between a pair of HTML tags using Bash

Tags:

html

bash

I need to get the HTML contents between a pair of given tags using a bash script. As an example, having the HTML code below:

<html>
<head>
</head>
<body>
 text
  <div>
  text2
    <div>
        text3
    </div>
  </div>
</body>
</html>

Using the bash command/script, given the body tag, we would get:

 text
  <div>
  text2
    <div>
    text3
    </div>
  </div>

Thanks in advance.

like image 827
Joao Avatar asked Jan 09 '14 08:01

Joao


5 Answers

Using sed in shell/bash, so you needn't install something else.

tag=body
sed -n "/<$tag>/,/<\/$tag>/p" file
like image 84
BMW Avatar answered Nov 14 '22 10:11

BMW


plain text processing is not good for html/xml parsing. I hope this could give you some idea:

kent$  xmllint --xpath "//body" f.html 
<body>
 text
  <div>
  text2
    <div>
        text3
    </div>
  </div>
</body>
like image 41
Kent Avatar answered Nov 14 '22 09:11

Kent


Personally I find it very useful to use hxselect command (often with help of hxclean) from package html-xml-utils. The latter fixes (sometimes broken) HTML file to correct XML file and the first one allows to use CSS selectors to get the node(s) you need. With use of the -c option, it strips surrounding tags. All these commands work on stdin and stdout. So in your case you should execute:

$ hxselect -c body <<HTML
  <html>
  <head>
  </head>
  <body>
    text
    <div>
      text2
      <div>
        text3
      </div>
    </div>
  </body>
  </html>
  HTML 

to get what you need. Plain and simple.

like image 11
Cromax Avatar answered Nov 14 '22 09:11

Cromax


Forgetting Bash due it's limitation, you can use nokogiri as command line util, as explained here.

Example:

curl -s http://example.com/ | nokogiri -e 'puts $_.search('\''a'\'')'
like image 5
Paulo Fidalgo Avatar answered Nov 14 '22 10:11

Paulo Fidalgo


Another option is to use the multi-platform xidel utility (home page on SourceForge, GitHub repository), which can handle both XML and HTML:

xidel -s in.html -e '/html/body/node()' --printed-node-format=html

The above prints the resulting HTML with syntax highlighting (colored), and seemingly with an empty line after the text node.

If you want the text only, Reino points out that you can simplify to:

xidel -s in.html -e '/html/body/inner-html()'
like image 5
mklement0 Avatar answered Nov 14 '22 10:11

mklement0