Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWK program using regex to count matching lines

Tags:

regex

bash

awk

The program is supposed to count the number of lines begin with a decimal number in parenthesis, containing a mix of both upper and lower case letters and end with a period.

I have

BEGIN {x=0}
/^\([0-9[0-9]*) [A-Z][A-z]* [a-z][a-z]* \.$/ {x = x+1}
END{print x}

I have them split on multiple different lines because I have been running display(!d) statements for debugging trying to figure it out. To run i use awk -f programName.awk filename.txt Any help is appreciated.

UPDATE

New code reads

BEGIN{x=0}
/^\([0-9]+\)[A-Za-z]+\.$/{x++}
END{print x}

I use vim EC.awk to edit this. awk -f EC.awk EC.txt to run comes back with 1. EC.txt contains 5 out of 12 lines that should be counted.

INPUT FILE vim EC.txt

(1) Line one, this should count.
(2)Line two. Should also count.
3 should not count..
4 not
(5)Yes.
(6). nope
7 OHHH mann
8 This suck
(9)Oh ya? YOU SUCK.
10 Cheaa
(11) BOI.
(12) WoW MoM. Print mofo.

UPDATED CODE

BEGIN{x=0}
/^\([0-9]+\).*?[A-Za-z]+\.$/{x++}
END{print x}

This gives me 6. I believe its counting line 11 (11) BOI. Working on printing out the lines to make sure.

like image 626
ChrisFocker Avatar asked Dec 25 '22 08:12

ChrisFocker


2 Answers

For an alternative solution that expresses the intent more simply and clearly and is also locale-aware (doesn't invariably only match ASCII letters), see Ed Morton's helpful answer.

Try the following (POSIX-compliant):

awk '/^\([0-9]+\).*([A-Z].*[a-z]|[a-z].*[A-Z]).*\.$/ { ++x } END { print x+0 }' file
  • ^\([0-9]+\) matches a decimal number in parentheses at the beginning of a line.

  • \.$ matches a literal period at the end of a line.

  • .*([A-Z].*[a-z]|[a-z].*[A-Z]).* matches any string in between that:

    • Either: contains at least 1 uppercase letter followed by at least 1 lowercase one.
    • Or: contains at least 1 lowercase letter followed by at least 1 uppercase one.
    • Thus, this expression should match any string containing any mix of lower- and uppercase [ASCII-only] letters, as long as least 1 uppercase and 1 lowercase letter is present.

As for why your approach didn't work:

  • Your initial solution attempt, [A-Z][A-z] *[a-z][a-z]*, only matches lines whose first [ASCII] letter on the line is uppercase; in other words: lines where the first letter on the line is lowercase aren't matched.
  • Your later solution attempt, [A-Za-z]+, due to using a single character set any of whose characters are matched, also matches lines containing only uppercase or lowercase letters, which is why line (11) BOI. also matches.
like image 172
mklement0 Avatar answered Jan 05 '23 14:01

mklement0


idk if this is the expected output or not since you didn't include that in your question but I just coded what you said in your question count the number of lines begin with a decimal number in parenthesis, containing a mix of both upper and lower case letters and end with a period and added the print so you can see what it matches so take a look and see if it does what you want:

$ cat tst.awk
/^\([0-9]+\)/ && /[[:upper:]]/ && /[[:lower:]]/ && /\.$/ { print; cnt++ }
END { print cnt+0 }

$ awk -f tst.awk file
(1) Line one, this should count.
(2)Line two. Should also count.
(5)Yes.
(9)Oh ya? YOU SUCK.
(12) WoW MoM. Print mofo.
5

Don't get stuck thinking that the condition part of an awk statement has to be a regexp, like if this was sed or grep, as it doesn't - it can be a compound condition of ands/ors of regexp segments if that's what makes your code simpler and clearer as in this case IMHO.

like image 26
Ed Morton Avatar answered Jan 05 '23 12:01

Ed Morton