This question occurs repeatedly in many forms with many different multi-character delimiters and so IMHO is worth a canonical answer.
Given an input file like:
<foo> .. 1 <foo> .. a<2 .. </foo> .. </foo> <foo> .. @{<>}@ <foo> .. 4 .. </foo> .. </foo> <foo> .. 5 .. </foo>
how do you extract the text between the nested start (<foo>
) and end (</foo>
) delimiters using non-greedy matching with awk?
Desired output (in any order) is:
<foo> .. a<2 .. </foo>
<foo> .. 1 .. </foo>
<foo> .. 4 .. </foo>
<foo> .. @{<>}@ .. </foo>
<foo> .. 5 .. </foo>
Note that start or end could be any multi-character string and the text between them could be anything except those strings, including characters which are part of those strings such as the <
or >
characters in this example.
The main challenge is that since awk only supports greedy matching you can't write any variation of <foo>.*</foo>
that will stop at the first </foo>
on the line instead of the last </foo>
. The solution is to convert each start and end string into a single character that cannot appear in the input so you can write x[^xy]*y
where x and y are those start/end characters but how do you choose a character that can't appear in the input? You don't - you make one:
$ cat nonGreedy.awk
{
$0 = encode($0)
while ( match($0,/({[^{}]*})/) ) {
print decode(substr($0,RSTART,RLENGTH))
$0 = substr($0,1,RSTART-1) substr($0,RSTART+RLENGTH)
}
}
function encode(str) {
gsub(/@/,"@A",str)
gsub(/{/,"@B",str); gsub(/}/,"@C",str)
gsub(/<foo>/,"{",str); gsub(/<\/foo>/,"}",str)
return str
}
function decode(str) {
gsub(/}/,"</foo>",str); gsub(/{/,"<foo>",str)
gsub(/@C/,"}",str); gsub(/@B/,"{",str)
gsub(/@A/,"@",str)
return str
}
$ awk -f nonGreedy.awk file
<foo> .. a<2 .. </foo>
<foo> .. 1 .. </foo>
<foo> .. 4 .. </foo>
<foo> .. @{<>}@ .. </foo>
<foo> .. 5 .. </foo>
The above works by you picking any character that can't appear JUST IN THE START/END STRINGS (note it doesn't have to be a character that can't appear in the input at all, just not in those strings), in this case I'm choosing @
, and appending an A
after each occurrence of it in the input. At this point every occurrence of @A
represents an @
character and there are guaranteed to be no occurrences of @B
or @
followed by anything else anywhere in the input.
Now we can pick 2 other characters that we want to use to represent the start/end strings, in this case I'm choosing {
and }
, and convert them to some @
-prefixed strings like @B
and @C
and at this point every occurrence of @B
represents a {
character and @C
represents a }
character and there are no {
s or }
s anywhere in the input.
Now all that's left to do to find the strings we want to extract is convert every start string <foo>
to the start character we've chosen, {
, and every end string </foo>
to the end character }
and then we can use a simple regexp of {[^{}]*}
to represent a non-greedy version of <foo>.*</foo>
.
As we find each string we just unwind the conversions we did above in reverse order (note you must unwind the substitutions to each matched string in exactly the reverse order you applied them to the whole record) so {
goes back to <foo>
and @B
goes back to {
, and @A
goes back to @
, etc. and we have the original text for that string.
The above will work in any awk. If your start/end strings contain RE metacharacters then you'd have to escape those or use a while(index(substr()))
loop instead of gsub()
to replace them.
Note that if you do use gawk and the labels aren't nested then you can keep the 2 functions exactly as above and change the rest of the script to just:
BEGIN { FPAT="{[^{}]*}" }
{
$0 = encode($0)
for (i=1; i<=NF; i++) {
print decode($i)
}
}
Obviously you don't really need to put the encode/decode functionality in separate functions, I just separated that out here to make that functionality explicit and separate from the loop that uses it for clarity.
For another example of when/how to apply the above approach, see https://stackoverflow.com/a/40540160/1745001.
My (current version of) solution approaches the problem from the front so the output is not exactly the same:
<foo> .. 1 # second
<foo> .. a<2 .. </foo> .. # first in my approach
</foo>
<foo> .. @{<>}@ # fourth
<foo> .. 4 .. </foo> .. # third
</foo>
<foo> .. 5 .. </foo> # fifth
if the program would traverse the arrays arr
and seps
backwards, the output would be the same (probably), but I just ran out of time temporarily.
In Gnu awk (for using split
with four params to parse the data).
EDIT For compatibility with others than Gnu awk I added function gsplit()
which is a crude Gnu awk split
replacement.
$ cat program.awk
{ data=data $0 } # append all records to one var
END {
n=gsplit(data, arr, "</?foo>", seps) # split by every tag
for(i=1;i<=n;i++) { # atm iterate arrays from front to back
if(seps[i]=="<foo>") # if element opening tag
stack[++j]=seps[i] arr[i+1] # store tag ang wait for closing tag
else {
stack[j]=stack[j] (seps[i]==prev ? arr[i] : "")
print stack[j--] seps[i]
}
prev = seps[i]
}
}
# elementary gnu awk split compatible replacement
function gsplit(str, arr, pat, seps, i) {
delete arr; delete seps; i=0
while(match(str, pat)) {
arr[++i]=substr(str,1,(RSTART-1))
seps[i]=substr(str,RSTART,RLENGTH)
str=substr(str,(RSTART+RLENGTH))
}
arr[++i]=substr(str,1)
return i
}
Run it:
$ awk -f program.awk file
<foo> .. a<2 .. </foo>
<foo> .. 1 .. </foo>
<foo> .. 4 .. </foo>
<foo> .. @{<>}@ .. </foo>
<foo> .. 5 .. </foo>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With