Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In awk, how can I use a file containing multiple format strings with printf?

Tags:

printf

awk

I have a case where I want to use input from a file as the format for printf() in awk. My formatting works when I set it in a string within the code, but it doesn't work when I load it from input.

Here's a tiny example of the problem:

$ # putting the format in a variable works just fine:
$ echo "" | awk -vs="hello:\t%s\n\tfoo" '{printf(s "bar\n", "world");}'
hello:  world
        foobar
$ # But getting the format from an input file does not.
$ echo "hello:\t%s\n\tfoo" | awk '{s=$0; printf(s "bar\n", "world");}'
hello:\tworld\n\tfoobar
$ 

So ... format substitutions work ("%s"), but not special characters like tab and newline. Any idea why this is happening? And is there a way to "do something" to input data to make it usable as a format string?

UPDATE #1:

As a further example, consider the following using bash heretext:

[me@here ~]$ awk -vs="hello: %s\nworld: %s\n" '{printf(s, "foo", "bar");}' <<<""
hello: foo
world: bar
[me@here ~]$ awk '{s=$0; printf(s, "foo", "bar");}' <<<"hello: %s\nworld: %s\n"
hello: foo\nworld: bar\n[me@here ~]$

As far as I can see, the same thing happens with multiple different awk interpreters, and I haven't been able to locate any documentation that explains why.

UPDATE #2:

The code I'm trying to replace currently looks something like this, with nested loops in shell. At present, awk is only being used for its printf, and could be replaced with a shell-based printf:

#!/bin/sh

while read -r fmtid fmt; do
  while read cid name addy; do
    awk -vfmt="$fmt" -vcid="$cid" -vname="$name" -vaddy="$addy" \
      'BEGIN{printf(fmt,cid,name,addy)}' > /path/$fmtid/$cid
  done < /path/to/sampledata
done < /path/to/fmtstrings

Example input would be:

## fmtstrings:
1 ID:%04d Name:%s\nAddress: %s\n\n
2 CustomerID:\t%-4d\t\tName: %s\n\t\t\t\tAddress: %s\n
3 Customer: %d / %s (%s)\n

## sampledata:
5 Companyname 123 Somewhere Street
12 Othercompany 234 Elsewhere

My hope was that I'd be able to construct something like this to do the entire thing with a single call to awk, instead of having nested loops in shell:

awk '

  NR==FNR { fmts[$1]=$2; next; }

  {
    for(fmtid in fmts) {
      outputfile=sprintf("/path/%d/%d", fmtid, custid);
      printf(fmts[fmtid], $1, $2) > outputfile;
    }
  }

' /path/to/fmtstrings /path/to/sampledata

Obviously, this doesn't work, both because of the actual topic of this question and because I haven't yet figured out how to elegantly make awk join $2..$n into a single variable. (But that's the topic of a possible future question.)

FWIW, I'm using FreeBSD 9.2 with its built in, but I'm open to using gawk if a solution can be found with that.

like image 520
Graham Avatar asked Jul 04 '14 13:07

Graham


People also ask

How strings are displayed with different formats in C?

An algorithm is given below to explain the process which is included in the C programming language to print the characters and strings in different formats. Step 1: Read a character to print. Step 2: Read a name at compile time. Step 3: Output of characters in different formats by using format specifiers.

How do we print report with awk programming?

The following example prints the first and second fields of each input record, separated by a semicolon, with a blank line added after each newline: $ awk 'BEGIN { OFS = ";"; ORS = "\n\n" } > { print $1, $2 }' BBS-list aardvark;555-5553 alpo-net;555-3412 barfly;555-7685 ...

What is the format specific used to print a string?

As we can see in the printf statement, %s is used as the format specifier to print these string values. Let us compile the program using the following gcc command.

Which of the following is format specification for printing string in printf ()?

%s is used as the format specifier which prints a string in C printf or scanf function. In C, %s allows us to print by commanding printf() to print any corresponding argument in the form of a string. The argument used is "char*" for %s to print.


2 Answers

Why so lengthy and complicated an example? This demonstrates the problem:

$ echo "" | awk '{s="a\t%s"; printf s"\n","b"}'
a       b

$ echo "a\t%s" | awk '{s=$0; printf s"\n","b"}'
a\tb

In the first case, the string "a\t%s" is a string literal and so is interpreted twice - once when the script is read by awk and then again when it is executed, so the \t is expanded on the first pass and then at execution awk has a literal tab char in the formatting string.

In the second case awk still has the characters backslash and t in the formatting string - hence the different behavior.

You need something to interpret those escaped chars and one way to do that is to call the shell's printf and read the results (corrected per @EtanReiser's excellent observation that I was using double quotes where I should have had single quotes, implemented here by \047, to avoid shell expansion):

$ echo 'a\t%s' | awk '{"printf \047" $0 "\047 " "b" | getline s; print s}'
a       b

If you don't need the result in a variable, you can just call system().

If you just wanted the escape chars expanded so you don't need to provide the %s args in the shell printf call, you'd just need to escape all the %s (watching out for already-escaped %s).

You could call awk instead of the shell printf if you prefer.

Note that this approach, while clumsy, is much safer than calling an eval which might just execute an input line like rm -rf /*.*!

With help from Arnold Robbins (the creator of gawk), and Manuel Collado (another noted awk expert), here is a script which will expand single-character escape sequences:

$ cat tst2.awk
function expandEscapes(old,     segs, segNr, escs, idx, new) {
    split(old,segs,/\\./,escs)
    for (segNr=1; segNr in segs; segNr++) {
        if ( idx = index( "abfnrtv", substr(escs[segNr],2,1) ) )
            escs[segNr] = substr("\a\b\f\n\r\t\v", idx, 1)
        new = new segs[segNr] escs[segNr]
    }
    return new
}

{
    s = expandEscapes($0)
    printf s, "foo", "bar"
}

.

$ awk -f tst2.awk <<<"hello: %s\nworld: %s\n"
hello: foo
world: bar

Alternatively, this shoudl be functionally equivalent but not gawk-specific:

function expandEscapes(tail,   head, esc, idx) {
    head = ""
    while ( match(tail, /\\./) ) {
        esc  = substr( tail, RSTART + 1, 1 )
        head = head substr( tail, 1, RSTART-1 )
        tail = substr( tail, RSTART + 2 )
        idx  = index( "abfnrtv", esc )
        if ( idx )
             esc = substr( "\a\b\f\n\r\t\v", idx, 1 )
        head = head esc
    }

    return (head tail)
} 

If you care to, you can expand the concept to octal and hex escape sequences by changing the split() RE to

/\\(x[0-9a-fA-F]*|[0-7]{1,3}|.)/

and for a hex value after the \\:

c = sprintf("%c", strtonum("0x" rest_of_str))

and for an octal value:

c = sprintf("%c", strtonum("0" rest_of_str))
like image 160
Ed Morton Avatar answered Oct 12 '22 11:10

Ed Morton


Since the question explicitly asks for an awk solution, here's one which works on all the awks I know of. It's a proof-of-concept; error handling is abysmal. I've tried to indicate places where that could be improved.

The key, as has been noted by various commentators, is that awk's printf -- like the C standard function it is based on -- does not interpret backslash-escapes in the format string. However, awk does interpret them in command-line assignment arguments.

awk 'BEGIN  {if(ARGC!=3)exit(1);
             fn=ARGV[2];ARGC=2}
     NR==FNR{ARGV[ARGC++]="fmt="substr($0,length($1)+2);
             ARGV[ARGC++]="fmtid="$1;
             ARGV[ARGC++]=fn;
             next}
     {match($0,/^ *[^ ]+[ ]+[^ ]+[ ]+/);
      printf fmt,$1,$2,substr($0,RLENGTH+1) > ("data/"fmtid"/"$1)
     }' fmtfile sampledata

( What's going on here is that the 'FNR==NR' clause (which executes only on the first file) adds the values (fmtid, fmt) from each line of the first file as command-line assignments, and then inserts the data file name as a command-line argument. In awk, assignments as command line arguments are simply executed as though they were assignments from a string constant with implicit quotes, including backslash-escape processing (except that if the last character in the argument is a backslash, it doesn't escape the implicit closing double-quote). This behaviour is mandated by Posix, as is the order in which arguments are processed which makes it possible to add arguments as you go.

As written, the script must be provided with exactly two arguments: the formats and the data (in that order). There is some room for improvement, obviously.

The snippet also shows two ways of concatenating trailing fields.

In the format file, I assume that the lines are well behaved (no leading spaces; exactly one space after the format id). With those constraints, substr($0, length($1)+2) is precisely the part of the line after the first field and a single space.

Processing the datafile, it may be necessary to do this with fewer constraints. First, the builtin match function is called with the regular expression /^ *[^ ]+[ ]+[^ ]+[ ]+/ which matches leading spaces (if any) and two space-separated fields, along with the following spaces. (It would be better to allow tabs, as well.) Once the regex matches (and matching shouldn't be assumed, so there's another thing to fix), the variables RSTART and RLENGTH are set, so substr($0, RLENGTH+1) picks up everything starting with the third field. (Again, this is all Posix-standard behaviour.)

Honestly, I'd use the shell printf for this problem, and I don't understand why you feel that solution is somehow sub-optimal. The shell printf interprets backslash escapes in formats, and the shell read -r will do the line splitting the way you want. So there's no reason for awk at all, as far as I can see.

like image 34
rici Avatar answered Oct 12 '22 13:10

rici