I am processing a fairly big collection of Tweets and I'd like to obtain, for each tweet, its mentions (other user's names, prefixed with an @
), if the mentioned user is also in the file:
users = new Dictionary()
for each line in file:
username = get_username(line)
userid = get_userid(line)
users.add(key = userid, value = username)
for each line in file:
mentioned_names = get_mentioned_names(line)
mentioned_ids = mentioned_names.map(x => if x in users: users[x] else null)
print "$line | $mentioned_ids"
I was already processing the file with GAWK, so instead of processing it again in Python or C I decided to try and add this to my AWK script. However, I can't find a way to make to passes over the same file, executing different code for each one. Most solutions imply calling AWK several times, but then I'd loose the associative array I made in the first pass.
I could do it in very hacky ways (like cat
'ing the file twice, passing it through sed
to add a different prefix to all the lines in each cat
), but I'd like to be able to understand this code in a couple of months without hating myself.
What would be the AWK way to do this?
The less terrible way I've found:
function rewind( i)
{
# from https://www.gnu.org/software/gawk/manual/html_node/Rewind-Function.html
# shift remaining arguments up
for (i = ARGC; i > ARGIND; i--)
ARGV[i] = ARGV[i-1]
# make sure gawk knows to keep going
ARGC++
# make current file next to get done
ARGV[ARGIND+1] = FILENAME
# do it
nextfile
}
BEGIN {
count = 1;
}
count == 1 {
# first pass, fills an associative array
}
count == 2 {
# second pass, uses the array
}
FNR == 30 {
# handcoded length, horrible
# could also be automated calling wc -l, passing as parameter
if (count == 1) {
count = 2;
rewind(1)
}
}
The awk variables $1 or $2 through $nn represent the fields of each record and should not be confused with shell variables that use the same style of names. Inside an awk script $1 refers to field 1 of a record; $2 to field 2 of a record.
awk '{print $9}' the last ls -l column including any spaces in the file name.
An Example with Two Rules. The awk utility reads the input files one line at a time. For each line, awk tries the patterns of each rule. If several patterns match, then several actions execute in the order in which they appear in the awk program.
Overview. Both the awk command and the sed command are powerful Linux command-line text processing utilities. We know that the sed command has a handy -i option to edit the input file “in-place”. In other words, it can save the changes back to the input file.
The idiomatic way to process two separate files, or the same file twice in awk is like this:
awk 'NR==FNR{
# fill associative array
next
}
{
# use the array
}' file1 file2
The total record number NR
is only equal to the record number for the current file FNR
on the first file. next
skips the second block for the first file. The second block is then processed for the second file. If file1
and file2
are the same file, then this passes through the file twice.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With