Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use awk variables in regular expressions?

Tags:

regex

awk

I have a file called domain which contains some domains. For example:

google.com facebook.com ... yahoo.com 

And I have another file called site which contains some sites URLs and numbers. For example:

image.google.com   10 map.google.com     8 ... photo.facebook.com  22 game.facebook.com   15 .. 

Now I'm going to count the url number each domain has. For example: google.com has 10+8. So I wrote an awk script like this:

BEGIN{   while(getline dom < "./domain" > 0) {     domain[dom]=0;   }   for(dom in domain) {     while(getline < "./site" > 0) {       if($1 ~/$dom$)   #if $1 end with $dom {         domain[dom]+=$2;       }     }   } } 

But the code if($1 ~/$dom$) doesn't run like I want. Because the variable $dom in the regular expression was explained literally. So, the first question is:

Is there any way to use variable $dom in a regular expression?

Then, as I'm new to writing script

Is there any better way to solve the problem I have?

like image 851
Hancy Avatar asked Jul 18 '12 04:07

Hancy


People also ask

Can I use regular expression with awk?

In awk, regular expressions (regex) allow for dynamic and complex pattern definitions. You're not limited to searching for simple strings but also patterns within patterns.

What is awk in regular expression?

A regular expression enclosed in slashes (' / ') is an awk pattern that matches every input record whose text belongs to that set. The simplest regular expression is a sequence of letters, numbers, or both. Such a regexp matches any string that contains that sequence.

Does awk use extended regex?

Here, we will refer to extended regular expressions as regular expressions in the context of AWK. In AWK, regular expressions are enclosed in forward slashes, '/' , (forming the AWK pattern) and match every input record whose text belongs to that set.


2 Answers

awk can match against a variable if you don't use the // regex markers.

if ( $0 ~ regex ){ print $0; }

In this case, build up the required regex as a string

regex = dom"$" 

Then match against the regex variable

if ( $1 ~ regex ) {   domain[dom]+=$2; } 
like image 100
Matt Avatar answered Sep 18 '22 05:09

Matt


First of all, the variable is dom not $dom -- consider $ as an operator to extract the value of the column number stored in the variable dom

Secondly, awk will not interpolate what's between // -- that is just a string in there.

You want the match() function where the 2nd argument can be a string that is treated as the regular expression:

if (match($1, dom "$")) {...} 

I would code a solution like:

awk '   FNR == NR {domain[$1] = 0; next}   {     for (dom in domain) {       if (match($1, dom "$")) {         domain[dom] += $2         break       }     }   }   END {for (dom in domain) {print dom, domain[dom]}} ' domain site  
like image 31
glenn jackman Avatar answered Sep 22 '22 05:09

glenn jackman