I'm trying to extract the contents of an HTML list using awk. Some list entries are multi-line.
Example input list:
<ul>
<li>
<b>2021-07-21:</b> Lorem ipsum
</li>
<li>
<b>2021-07-19:</b> Lorem ipsum
</li>
<li><b>2021-07-10:</b> Lorem ipsum</li>
</ul>
Command I'm using:
awk -v RS="" '{match($0, /<li>(.+)<\/li>/, entry); print entry[1]}' file.html
Current output:
<b>2021-07-21:</b> Lorem ipsum
</li>
<li>
<b>2021-07-19:</b> Lorem ipsum
</li>
<li><b>2021-07-10:</b> Lorem ipsum
Desired output:
<b>2021-07-21:</b> Lorem ipsum
<b>2021-07-19:</b> Lorem ipsum
<b>2021-07-10:</b> Lorem ipsum
I know the issue is because the list entries are not separated by empty lines. I thought of using non-greedy matching, but apparently Awk doesn't support it. Is there a possible workaround?
With GNU awk for multi-char RS and \s
shorthand for [[:space:]]
:
$ awk -v RS='\\s*</?li>\\s*' '!(NR%2)' file
<b>2021-07-21:</b> Lorem ipsum
<b>2021-07-19:</b> Lorem ipsum
<b>2021-07-10:</b> Lorem ipsum
I assume you either don't really want the leading white space shown in the Expected Output in your question or you don't care if it's present or not.
With your shown samples, please try following awk
code. Written and tested in GNU awk
.
awk -v RS='</li>' '
match($0,/<li>.*/){
val=substr($0,RSTART,RLENGTH)
gsub(/<li>\n*[[:space:]]*|\n*[[:space:]]*$/,"",val)
print val
}
' Input_file
Explanation: Adding detailed explanation for above.
awk -v RS='</li>' ' ##Starting awk program from here and setting RS as </li> here.
match($0,/<li>.*/){ ##Matching <li> till end of line here.
val=substr($0,RSTART,RLENGTH) ##Creating val which has matched regex value here.
gsub(/<li>\n*[[:space:]]*|\n*[[:space:]]*$/,"",val) ##Globally substituting <li> followed by 0 or more new lines followed by 0 or more spaces OR substituting ending new lines or spaces with NULL in val.
print val ##Printing val here.
}
' Input_file ##Mentioning Input_file name here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With