Extract HTML tag data with sed

Question

I wish to extract data between known HTML tags. For example:

Hello, <i>I<i> am <i>very</i> glad to meet you.

Should become:

'I

very'

So I have found something that works to nearly do this. Unfortunately, it only extracts the last entry.

sed -n -e 's/.*<i>$.*$</i>.*/\1/p'

Now I can append any end tag </i> with a newline character and this works fine. But is there a way to do it with just one sed command?

Dennis Williamson · Accepted Answer

Give this a try:

sed -n 's|[^<]*<i>$[^<]*$</i>[^<]*|\1\n|gp'

And your example is missing a "/":

Hello, <i>I</i> am <i>very</i> glad to meet you.

lattimore · Answer

Try this:

$ sed 's/<[^>]*>//g' file.html

Donate For Us