Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Awk multiline non-greedy matching workaround

Tags:

html

awk

I'm trying to extract the contents of an HTML list using awk. Some list entries are multi-line.

Example input list:

<ul>
    <li>
        <b>2021-07-21:</b> Lorem ipsum 
    </li>
    <li>
        <b>2021-07-19:</b> Lorem ipsum 
    </li>
    <li><b>2021-07-10:</b> Lorem ipsum</li>
</ul>

Command I'm using:

awk -v RS="" '{match($0, /<li>(.+)<\/li>/, entry); print entry[1]}' file.html

Current output:

        <b>2021-07-21:</b> Lorem ipsum 
    </li>
    <li>
        <b>2021-07-19:</b> Lorem ipsum 
    </li>
    <li><b>2021-07-10:</b> Lorem ipsum

Desired output:

        <b>2021-07-21:</b> Lorem ipsum 
        <b>2021-07-19:</b> Lorem ipsum 
    <b>2021-07-10:</b> Lorem ipsum

I know the issue is because the list entries are not separated by empty lines. I thought of using non-greedy matching, but apparently Awk doesn't support it. Is there a possible workaround?

like image 769
S9oXavyF Avatar asked Dec 02 '22 09:12

S9oXavyF


2 Answers

With GNU awk for multi-char RS and \s shorthand for [[:space:]]:

$ awk -v RS='\\s*</?li>\\s*' '!(NR%2)' file
<b>2021-07-21:</b> Lorem ipsum
<b>2021-07-19:</b> Lorem ipsum
<b>2021-07-10:</b> Lorem ipsum

I assume you either don't really want the leading white space shown in the Expected Output in your question or you don't care if it's present or not.

like image 87
Ed Morton Avatar answered Dec 26 '22 11:12

Ed Morton


With your shown samples, please try following awk code. Written and tested in GNU awk.

awk -v RS='</li>' '
match($0,/<li>.*/){
  val=substr($0,RSTART,RLENGTH)
  gsub(/<li>\n*[[:space:]]*|\n*[[:space:]]*$/,"",val)
  print val
}
' Input_file

Explanation: Adding detailed explanation for above.

awk -v RS='</li>' '              ##Starting awk program from here and setting RS as </li> here.
match($0,/<li>.*/){              ##Matching <li> till end of line here.
  val=substr($0,RSTART,RLENGTH)  ##Creating val which has matched regex value here.
  gsub(/<li>\n*[[:space:]]*|\n*[[:space:]]*$/,"",val)  ##Globally substituting <li> followed by 0 or more new lines followed by 0 or more spaces OR substituting ending new lines or spaces with NULL in val.
  print val                      ##Printing val here.
}
' Input_file                     ##Mentioning Input_file name here.
like image 31
RavinderSingh13 Avatar answered Dec 26 '22 10:12

RavinderSingh13