Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to delete string between two HTML tags in one row using bash script

Tags:

regex

bash

I have been working on some simple bash script recently, which parses specific data from webpages. I have used tr '\r\n' ' ' <file1.txt >file2.txt to make sure, all extracted data from page is stored in file1.txt in one row. So then I need to match all strings between <th>...</th> tags in this line and delete them or replace with ' ' sign. So here is some expamle code:

    <td>Abaktal hm</td> </tr> <tr> <th>Package</th> <td>flm 10x400 mg</td> <th>Indesit</th>

I have used sed and tried something like

    sed -i 's/\<th\>.*?\<\/th\>/ /g' output.txt

But it didn't work. I think problem is in ? sign. It works with ? sign in regular expressions, but probably not in bash.

like image 773
UncleSam Avatar asked May 15 '26 07:05

UncleSam


1 Answers

While I agree with sputnick and others, the answer to your immediate question would be:

sed -ir 's/<th>[^<]+<\/th>//g'

This works on your sample data just fine.

like image 71
weldabar Avatar answered May 17 '26 01:05

weldabar



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!