Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to set a multicharacter record separator RS in GNU awk so it encompasses the new lines?

Tags:

regex

awk

I am using GNU Awk 4.1.3. I want to process this file:

$$$$
1
1
$$$$
2
2
$$$$
3
3
$$$$
1
clave
2
$$$$
5
5
$$$$

And print the block of lines that go between "$$$$" and the next "$$$$" when that given block contains the text "clave" in it. That is, with the given example I want this output:

1
clave
2

My solution is to set the record separator RS to the string "$$$$". Since it is a special character, I need to escape it, so it ends up being like RS='\\$\\$\\$\\$':

awk -v RS='\\$\\$\\$\\$' '/clave/' file

The problem with this is that the result contains a new line before and after the block:

$ awk -v RS='\\$\\$\\$\\$' '/clave/' file

1
clave
2

This is because there is a new line between the end of "$$$$" and "1", and there is also a new line between "2" and the next "$$$$".

To avoid this, I am adding the new line on both ends of the record separator, so it becomes RS='\n\$\$\$\$\n'. It works well:

$ awk -v RS='\n\\$\\$\\$\\$\n' '/clave/' file
#            ^^^           ^^
1
clave
2

However, this becomes quite complex and I am wondering if including the new line in the record separator may have some side effects that I am not aware of.

For this, I wonder: how can I set the record separator so it encompasses the new lines? Is my approach valid or should I go for other options because my approach has some drawbacks?

like image 613
fedorqui 'SO stop harming' Avatar asked Nov 18 '20 09:11

fedorqui 'SO stop harming'


People also ask

What is record separator in awk?

The awk utility divides the input for your awk program into records and fields. Records are separated by a character called the record separator. By default, the record separator is the newline character. This is why records are, by default, single lines.

What is record separator?

A delimiter, i.e., a character, used to indicate the end of one record or the beginning of the next record. Synonymrecord separator.

What is RT in awk?

When RS is a single character, RT contains the same single character. However, when RS is a regular expression, RT contains the actual input text that matched the regular expression. If the input file ends without any text matching RS , gawk sets RT to the null string.

How do I change the character for the record separator in AWK?

To use a different character for the record separator, simply assign that character to the predefined variable RS . Like any other variable, the value of RS can be changed in the awk program with the assignment operator, ‘ = ’ (see section Assignment Expressions ).

Can AWK process Multi-Line Records?

Actually, as said before, I found with awk on HP, it can process record of multi-lines, only problem is it only takes the first character I set with "RS" as the separater instead of the whole string. 07-13-2004 03:44 AM 07-13-2004 03:44 AM

Is Rs a regular expression in AWK?

gawk allows RS to be a full regular expression (discussed shortly; see section Record Splitting with gawk ). Even so, using a regular expression metacharacter, such as ‘. ’ as the single character in the value of RS has no special effect: it is treated literally. This is required for backwards compatibility with both Unix awk and with POSIX.

What is an example of AWK action with no pattern?

For example: changes the value of RS to ‘u’, before reading any input. The new value is a string whose first character is the letter “u”; as a result, records are separated by the letter “u”. Then the input file is read, and the second rule in the awk program (the action with no pattern) prints each record.


2 Answers

You should be matching on the newline before and after the 4 $s as THAT is the real separator (a string of 4 $s on a line of it's own), anything else could fail if 4 $s appeared in your data. The first sting of $s won't have a newline before it of course, it'll match the start-of-string indicator (^) instead, so you need to use:

$ awk -v RS='(^|\n)[$]{4}\n' '/clave/' file
1
clave
2

I find [$] easier to read than \\$, YMMV.

like image 87
Ed Morton Avatar answered Oct 19 '22 14:10

Ed Morton


You are getting a newline before and after because there is a new line before and after $$$$ in your file and by setting RS to $$$$ you are leaving those line breaks in record.

Change your RS to include a newline or start before and a newline or end afterwards, so that a record will be without those line breaks:

awk -v RS='(^|\n)\\${4}(\n|$)' '/clave/' fike

1
clave
2

Also note that you can use fix length quantifier \\${4} instead of \\$\\$\\$\\$.

like image 43
anubhava Avatar answered Oct 19 '22 13:10

anubhava