Stumbled across a certain limit (?) when passing data through "cmd | getline" (with both macOS awk 20200816 and GNU Awk 5.3.1).
Reproducible examples:
file.txt - a text file filled with "lorem ipsum" and one <title>test</title> string for the testing purposes, no newlines. Instead of xmllint other program can be used, for example grep -o \047^[^ ]*\047 with the only difference that for grep minimal "working" size is 30 bytes larger, 1 044 977 bytes.
gawk '{ cmd = ("xmllint --html --xpath \047//title\047 - 2>/dev/null <<< \047"$0"\047")
cmd | getline word
print word
close(cmd)
}' < file.txt
This prints the "word" value (i.e., <title>test</title>) if the file contains 1 044 947 characters or less, if more - prints nothing.
gawk '{ cmd = ("xmllint --html --xpath \047//title\047 - 2>/dev/null")
print $0 | cmd
}' < file-x2.txt
This prints <title>test</title> even if the file's content is twice as big. It would be OK, but I need the result held in a variable, not printed in the terminal stright away.
What is this limit? Is it imposed by gawk, shell, OS?
The limit does not come from the |, but from passing your whole $0 on cmd's command line rather than through its stdin (you did it on purpose, as a workaround to awk proposing either print $0 | cmd or cmd | getline word, but no double pipe as in print $0 | cmd | getline word).
Your OS theorically defines a limit to the maximum length that you can pass as arguments to a command (you can see this limit thanks to getconf ARG_MAX, which should show you this 1 MB limit);
however, the shells (bash on my FreeBSD, but probably zsh too on your macOS) do not impose this restriction when using the <<< operator to convert data passed on the command line to stdin for the forked process; you can verify it by running xmllint --html --xpath '//title' - 2>/dev/null <<< "`cat file.txt`": you'll won't encounter this limit.
However, before reaching the shell (where this special rule of "the string after <<< looks like an argument but in fact is an stdin" is implemented),
awk has to pass this string uninterpreted through popen(), where ARG_MAX is enforced.
Instead of truncating data that exceeds that limit, popen(), although returning a valid FILE * (not allowing awk to detect there was problem), simply discards it entirely, thus a 0-length input passed to your xmllint.
To circumvent this limit, I'd say you could print to a temporary file (or a pipe file), then xmllint from this file or pipe.
popen()'s limitsYou can compile a small C program to test invoking an echo through popen(), to emulate awk's internal workings:
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
int main(int argc, char ** argv)
{
int bs, ts, n;
char * cmd;
char buf[0x10000];
long argmax = sysconf(_SC_ARG_MAX);
fprintf(stdout, "ARG_MAX = %ld\n", argmax);
argmax += 10;
cmd = malloc(argmax);
strcpy(cmd, "echo ");
memset(&cmd[strlen(cmd)], 'g', argmax - strlen(cmd) - 1);
for(n = 0; ++n < 2000;)
{
cmd[argmax - n] = 0;
FILE * f = popen(cmd, "r");
ts = 0;
while((bs = fread(buf, 1, 0x10000, f)) > 0)
ts += bs;
fclose(f);
fprintf(stdout, "%lu bytes cmd → %d bytes echo (popen returned %p)\n", strlen(cmd), ts, f);
}
}
On my FreeBSD 14.3 system, I get:
ARG_MAX = 524288
524297 bytes cmd → 0 bytes echo (popen returned 0x8225f9c40)
524296 bytes cmd → 0 bytes echo (popen returned 0x8225f9c40)
…
523013 bytes cmd → 0 bytes echo (popen returned 0x8225f9c40)
523012 bytes cmd → 0 bytes echo (popen returned 0x8225f9c40)
523011 bytes cmd → 0 bytes echo (popen returned 0x8225f9c40)
523010 bytes cmd → 523006 bytes echo (popen returned 0x8225f9c40)
523009 bytes cmd → 523005 bytes echo (popen returned 0x8225f9c40)
523008 bytes cmd → 523004 bytes echo (popen returned 0x8225f9c40)
…
so I can pass up to ARG_MAX - 1278 bytes to my command.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With