I'm attempting to convert dates from one format to another: From e.g. "October 29, 2005" to 2005-10-29. I have a list of 625 dates. I use Awk.
The conversion works -- most of the time. Hovewer, sometimes the conversion won't happen at all, and the variable supposed to hold the (converted) date remains undefined.
This always happens with the exact same rows. Running `date' explicitly (from the Bash shell) on the dates of those weird rows works fine (the dates are properly converted). -- It's not the textual contents of those rows that matters.
Why this behavior, and how can I fix my script?
Her it is:
awk 'BEGIN { FS = "unused" } {
x = "undefined";
"date \"+%Y-%m-%d\" -d " $1 | getline x ;
print $1 " = " x
}' uBXr0r15.txt \
> bug-out-3.txt
If you want to reproduce this problem:
Then you could run the script again, and (on my computer) bug-out-3.txt remains unchanged -- exactly the same dates are left undefined.
(Gawk version 3.1.6, Ubuntu 9.10.)
Kind regards, Magnus
The awk language has a special built-in command called getline that can be used to read input under your explicit control. The getline command is used in several different ways and should not be used by beginners.
In the typical awk program, all input is read either from the standard input (by default the keyboard, but often a pipe from another command) or from files whose names you specify on the awk command line. If you specify input files, awk reads them in order, reading all the data from one before going on to the next.
The name of the current input file set in FILENAME variable. You can use FILENAME to display or print current input file name If no files are specified on the command line, the value of FILENAME is “-” (stdin). However, FILENAME is undefined inside the BEGIN rule unless set by getline.
Whenever you open a pipe or file for reading or writing in awk
, the latter will first check (using an internal hash) whether it already has a pipe or file with the same name (still) open; if so, it will reuse the existing file descriptor instead of reopening the pipe or file.
In your case, all entries which end up as undefined
are actually duplicates; the first time that they are encountered (i.e. when the corresponding command date "..." -d "..."
is first issued) the proper result is read into x
. On subsequent occurrences of the same date, getline
attempts to read a second, third etc. lines from the original date
pipe, even though the pipe has been closed by date
, resulting in x
no longer being assigned.
From the gawk
man-page:
NOTE: If using a pipe, co-process, or socket to getline, or from print or printf within a loop, you must use close() to create new instances of the command or socket. AWK does not automatically close pipes, sockets, or co-processes when they return EOF.
You should explicitly close
the pipe every time after you have read x
:
close("date \"+%Y-%m-%d\" -d " $1)
Incidentally, would it be OK to sort
and uniq
uBXr0r15.txt
before piping into awk
, or do you need the original ordering/duplication?
Though I love awk it is not necessary for this.
tr -d '"' < uBXr0r15.txt | date +%Y-%m-%d -f -
gawk 'BEGIN{
m=split("January|February|March|April|May|June|July|August|September|October|November|December",d,"|")
for(o=1;o<=m;o++){
months[d[o]]=sprintf("%02d",o)
}
FS="[, ]"
}
{
gsub(/["]/,"",$1)
gsub(/["]/,"",$4)
t=mktime($4" "months[$1]" "$2" 0 0 0")
print strftime("%Y-%m-%d",t)
}' uBXr0r15.txt
doing everything inside gawk will be faster than calling external commands.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With