Am an AWK newbie, using GNU utilities ported to Windows (UNXUtils) and gawk instead of awk. A solution on this forum worked like absolute magic, and I'm trying to find a source I can read to understand better the pattern expression offered in that solution.
In Select unique or distinct values from a list in UNIX shell script an answer by Dimitre Radoulov offering the following code
zsh-4.3.9[t]% awk '!_[$0]++' file
as a solution for selecting elements of a list with repeated and jumbled elements, listing each element only once.
I had previously used sort | uniq
to do this, which worked fine for small test files. For my actual problem (extracting the list of company symbols from archival order book research data from India's National Stock Exchange for 16 days in April 2006, with 129+ million records in multiple files), the sorting burden became too much. And uniq only eliminates adjacent duplicates.
Copying the above line for my Win-GNU gawk, I used
C:\Users\PAPERS\> cat ..\Full*_Symbols.txt | gawk "!_[$0]++" | wc -l
946
suggesting that the 129+ million records pertained to 946 different firms, which is a VERY reasonable answer. And it took under 5 minutes on my modest Windows machine, after hours of trying to SORT wore me out.
Looked at all the awk texts I have and searched a bit online, and while for part of the pattern the explanation of why it worked is clear (!
serves as NOT, $0
is the whole current record), for the underscore _
I am not able to find any explanation, and have seen ++
in examples only as "update the counter by 1."
Will be grateful for any appropriate text or web reference to understand this example fully, as I think it will help me in other related cases as well. Thanks. Best,
It is really very clever!
It creates an associative array (meaning the "index" can be anything, not just a number). If the element doesn't exist (is zero) it is created (by incrementing it), and when there is a match awk
performs the default action (which is to print the input line). Once the value has been found, the _[$0]
will be non-zero so if the same value is encountered again the expression is false and nothing is printed.
I think the underscore is just a "vanilla" variable name (you need a name for your array and underscore is as valid as monkey
but more "anonymous".
A classic!
There is no explanation for the _
except that some people think it's clever to obfuscate their code by using an underscore character as the name of a variable, in this case an array. Like in C, variable names in awk can start with any letter or underscore but obviously the intent isn't to have them ONLY be an underscore - that's just ridiculous!
The more common and reasonable way to write that code is to name the array seen
or similar so you have some clue what it's for:
awk '!seen[$0]++'
The above introduces an array named seen
indexed by the text on the current line. When first tested the array at each index has zero value, when tested again with the same string it has value 1 and so on due to the post-increment. Therefore the negation of that value is only true when the first occurrence of a given string is seen in input and so it discards subsequent occurrences.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With