Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter column with awk and regexp

Tags:

I've a pretty simple question. I've a file containing several columns and I want to filter them using awk.

So the column of interest is the 6th column and I want to find every string containing :

  • starting with a number from 1 to 100
  • after that one "S" or a "M"
  • again a number from 1 to 100
  • after that one "S" or a "M"

So per example : 20S50M is ok

I tried :

awk '{ if($6 == '/[1-100][S|M][1-100][S|M]/') print} file.txt 

but it didn't work... What am I doing wrong?

like image 720
Nicolas Rosewick Avatar asked Sep 23 '13 14:09

Nicolas Rosewick


People also ask

Can you use regex with awk?

In awk, regular expressions (regex) allow for dynamic and complex pattern definitions. You're not limited to searching for simple strings but also patterns within patterns.

What is pattern matching in awk?

Any awk expression is valid as an awk pattern. The pattern matches if the expression's value is nonzero (if a number) or non-null (if a string). The expression is reevaluated each time the rule is tested against a new input record.


2 Answers

This should do the trick:

awk '$6~/^(([1-9]|[1-9][0-9]|100)[SM]){2}$/' file 

Regexplanation:

^                        # Match the start of the string (([1-9]|[1-9][0-9]|100)  # Match a single digit 1-9 or double digit 10-99 or 100 [SM]                     # Character class matching the character S or M ){2}                     # Repeat everything in the parens twice $                        # Match the end of the string 

You have quite a few issue with your statement:

awk '{ if($6 == '/[1-100][S|M][1-100][S|M]/') print} file.txt 
  • == is the string comparision operator. The regex comparision operator is ~.
  • You don't quote regex strings (you never quote anything with single quotes in awk beside the script itself) and your script is missing the final (legal) single quote.
  • [0-9] is the character class for the digit characters, it's not a numeric range. It means match against any character in the class 0,1,2,3,4,5,6,7,8,9 not any numerical value inside the range so [1-100] is not the regular expression for digits in the numerical range 1 - 100 it would match either a 1 or a 0.
  • [SM] is equivalent to (S|M) what you tried [S|M] is the same as (S|\||M). You don't need the OR operator in a character class.

Awk using the following structure condition{action}. If the condition is True the actions in the following block {} get executed for the current record being read. The condition in my solution is $6~/^(([1-9]|[1-9][0-9]|100)[SM]){2}$/ which can be read as does the sixth column match the regular expression, if True the line gets printed because if you don't get any actions then awk will execute {print $0} by default.

like image 54
Chris Seymour Avatar answered Mar 06 '23 08:03

Chris Seymour


Regexes cannot check for numeric values. "A number from 1 to 100" is outside what regexes can do. What you can do is check for "1-3 digits."

You want something like this

/\d{1,3}[SM]\d{1,3}[SM]/ 

Note that the character class [SM] doesn't have the ! alternation character. You would only need that if you were writing it as (S|M).

like image 24
Andy Lester Avatar answered Mar 06 '23 09:03

Andy Lester