AWK: go through the file twice, doing different tasks

Tags:

awk

I am processing a fairly big collection of Tweets and I'd like to obtain, for each tweet, its mentions (other user's names, prefixed with an @), if the mentioned user is also in the file:

users = new Dictionary()
for each line in file:
   username = get_username(line)
   userid   = get_userid(line)
   users.add(key = userid, value = username)
for each line in file:
   mentioned_names = get_mentioned_names(line)
   mentioned_ids = mentioned_names.map(x => if x in users: users[x] else null)
   print "$line | $mentioned_ids"

I was already processing the file with GAWK, so instead of processing it again in Python or C I decided to try and add this to my AWK script. However, I can't find a way to make to passes over the same file, executing different code for each one. Most solutions imply calling AWK several times, but then I'd loose the associative array I made in the first pass.

I could do it in very hacky ways (like cat'ing the file twice, passing it through sed to add a different prefix to all the lines in each cat), but I'd like to be able to understand this code in a couple of months without hating myself.

What would be the AWK way to do this?

PD:

The less terrible way I've found:

function rewind(    i)
{
    # from https://www.gnu.org/software/gawk/manual/html_node/Rewind-Function.html
    # shift remaining arguments up
    for (i = ARGC; i > ARGIND; i--)
        ARGV[i] = ARGV[i-1]

    # make sure gawk knows to keep going
    ARGC++

    # make current file next to get done
    ARGV[ARGIND+1] = FILENAME

    # do it
    nextfile
}

BEGIN {
 count = 1;
}

count == 1 {
 # first pass, fills an associative array
}

count == 2 {
 # second pass, uses the array
}

FNR == 30 { 
   # handcoded length, horrible
   # could also be automated calling wc -l, passing as parameter
  if (count == 1) {
        count = 2;
        rewind(1)
    }
}

261

asked Feb 16 '15 14:02

jesusiniesta

1 Answers

The idiomatic way to process two separate files, or the same file twice in awk is like this:

awk 'NR==FNR{ 
    # fill associative array 
    next
}
{
    # use the array
}' file1 file2

The total record number NR is only equal to the record number for the current file FNR on the first file. next skips the second block for the first file. The second block is then processed for the second file. If file1 and file2 are the same file, then this passes through the file twice.

answered Sep 28 '22 10:09

Tom Fenech

Related questions
                            
                                Embedded awk for .NET?
                            
                                What is a shell command to find the longest common substring of two strings in unix?
                            
                                Can sed regex simulate lookbehind and lookahead?
                            
                                shell script to add header to a file
                            
                                grep pattern from file, print the pattern instead matched string
                            
                                Maximum and Minimum using awk
                            
                                How can you tell which characters are in which character classes?
                            
                                awk solution for comparing current line to next line and printing one of the lines based on a condition
                            
                                How to find words from one file in another file?
                            
                                replacement for cut --output-delimiter
                            
                                AWK to filter CSV files
                            
                                How to read files with different encodings using Awk?
                            
                                Combining multiple lines into one line
                            
                                Awk replace a column with its hash value
                            
                                Why is awk not printing out newlines?
                            
                                Run curl command on each line of a file and fetch data from result
                            
                                print specific field from specific line of csv file linux
                            
                                awk script to parse output of ps command
                            
                                Writing to an excel sheet using Bash
                            
                                Search pattern containing forward slash using AWK

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With