Awk multiline non-greedy matching workaround

Question

I'm trying to extract the contents of an HTML list using awk. Some list entries are multi-line.

Example input list:

<ul>
    <li>
        <b>2021-07-21:</b> Lorem ipsum 
    </li>
    <li>
        <b>2021-07-19:</b> Lorem ipsum 
    </li>
    <li><b>2021-07-10:</b> Lorem ipsum</li>
</ul>

Command I'm using:

awk -v RS="" '{match($0, /<li>(.+)<\/li>/, entry); print entry[1]}' file.html

Current output:

        <b>2021-07-21:</b> Lorem ipsum 
    </li>
    <li>
        <b>2021-07-19:</b> Lorem ipsum 
    </li>
    <li><b>2021-07-10:</b> Lorem ipsum

Desired output:

        <b>2021-07-21:</b> Lorem ipsum 
        <b>2021-07-19:</b> Lorem ipsum 
    <b>2021-07-10:</b> Lorem ipsum

I know the issue is because the list entries are not separated by empty lines. I thought of using non-greedy matching, but apparently Awk doesn't support it. Is there a possible workaround?

Ed Morton · Accepted Answer

With GNU awk for multi-char RS and \s shorthand for [[:space:]]:

$ awk -v RS='\s*</?li>\s*' '!(NR%2)' file
<b>2021-07-21:</b> Lorem ipsum
<b>2021-07-19:</b> Lorem ipsum
<b>2021-07-10:</b> Lorem ipsum

I assume you either don't really want the leading white space shown in the Expected Output in your question or you don't care if it's present or not.

RavinderSingh13 · Answer

With your shown samples, please try following awk code. Written and tested in GNU awk.

awk -v RS='</li>' '
match($0,/<li>.*/){
  val=substr($0,RSTART,RLENGTH)
  gsub(/<li>
*[[:space:]]*|
*[[:space:]]*$/,"",val)
  print val
}
' Input_file

Explanation: Adding detailed explanation for above.

awk -v RS='</li>' '              ##Starting awk program from here and setting RS as </li> here.
match($0,/<li>.*/){              ##Matching <li> till end of line here.
  val=substr($0,RSTART,RLENGTH)  ##Creating val which has matched regex value here.
  gsub(/<li>
*[[:space:]]*|
*[[:space:]]*$/,"",val)  ##Globally substituting <li> followed by 0 or more new lines followed by 0 or more spaces OR substituting ending new lines or spaces with NULL in val.
  print val                      ##Printing val here.
}
' Input_file                     ##Mentioning Input_file name here.

Awk multiline non-greedy matching workaround

Tags:

html

awk

S9oXavyF

2 Answers

Ed Morton

RavinderSingh13

Recent Activity

Donate For Us

Awk multiline non-greedy matching workaround

Tags:

html

awk

S9oXavyF

2 Answers

Ed Morton

RavinderSingh13

Related questions

Recent Activity

Donate For Us