Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

gawk FS to split record into individual characters

Tags:

awk

gawk

If the field separator is the empty string, each character becomes a separate field

$ echo hello | awk -F '' -v OFS=, '{$1 = NF OFS $1} 1'
5,h,e,l,l,o

However, if FS is a regex that can possibly match zero times, the same behaviour does not occur:

$ echo hello | awk -F ' *' -v OFS=, '{$1 = NF OFS $1} 1'
1,hello

Anyone know why that is? I could not find anything in the gawk manual. Is FS="" just a special case?

I'm most interested in understanding why the 2nd case does not split the record into more fields. It's as if awk is treating FS=" *" like FS=" +"

like image 983
glenn jackman Avatar asked Feb 26 '14 14:02

glenn jackman


People also ask

What does FS do in awk?

awk Built-in Variables FS - Field Separator The variable FS is used to set the input field separator. In awk , space and tab act as default field separators. The corresponding field value can be accessed through $1 , $2 , $3 ... and so on.

How do you specify a separator in awk?

Just put your desired field separator with the -F option in the AWK command and the column number you want to print segregated as per your mentioned field separator.

What is RT in awk?

RT is set each time a record is read. It contains the input text that matched the text denoted by RS , the record separator. This variable is a gawk extension. In other awk implementations, or if gawk is in compatibility mode (see section Command Line Options), it is not special.

What is the default separator in awk?

The default value of the field separator FS is a string containing a single space, " " . If awk interpreted this value in the usual way, each space character would separate fields, so two spaces in a row would make an empty field between them.


1 Answers

Interesting question!

I just pulled gnu-awk 4.1.0's codes, I think the answer we could find in the file field.c.

line 371:
 * re_parse_field --- parse fields using a regexp.
 *
 * This is called both from get_field() and from do_split()
 * via (*parse_field)().  This variation is for when FS is a regular
 * expression -- either user-defined or because RS=="" and FS==" "
 */
static long
re_parse_field(lo...

also this line: (line 425):

if (REEND(rp, scan) == RESTART(rp, scan)) {   /* null match */

here is the case of <space>* matching in your question. The implementation didn't increment the nf, that is, it thinks the whole line is one single field. Note this function was used in do_split() function too.

First, if FS is null string, gawk separates each char into its own field. gawk's doc has clearly written this, also in codes, we could see:

line 613:
 * null_parse_field --- each character is a separate field
 *
 * This is called both from get_field() and from do_split()
 * via (*parse_field)().  This variation is for when FS is the null string.
 */
static long
null_parse_field(long up_to,

If the FS has single character, awk won't consider it as regex. This was mentioned in doc too. Also in codes:

#line 667
 * sc_parse_field --- single character field separator
 *
 * This is called both from get_field() and from do_split()
 * via (*parse_field)().  This variation is for when FS is a single character
 * other than space.
 */
static long
sc_parse_field(l

if we read the function, no regex match handling was done there.

In the comments of the function re_parse_field(), and sc_parse_field(), we see do_split invokes them too. It explains why we have 1 in following command instead of 3:

kent$  echo "foo"|awk '{split($0,a,/ */);print length(a)}'
1

Note, to avoid to make the post too long, I didn't paste the complete codes here, we can find the codes here:

http://git.savannah.gnu.org/cgit/gawk.git/

like image 93
Kent Avatar answered Oct 02 '22 18:10

Kent