AWK program using regex to count matching lines

Question

The program is supposed to count the number of lines begin with a decimal number in parenthesis, containing a mix of both upper and lower case letters and end with a period.

I have

BEGIN {x=0}
/^$[0-9[0-9]*) [A-Z][A-z]* [a-z][a-z]* \.$/ {x = x+1}
END{print x}

I have them split on multiple different lines because I have been running display(!d) statements for debugging trying to figure it out. To run i use awk -f programName.awk filename.txt Any help is appreciated.

UPDATE

New code reads

BEGIN{x=0}
/^\([0-9]+$[A-Za-z]+\.$/{x++}
END{print x}

I use vim EC.awk to edit this. awk -f EC.awk EC.txt to run comes back with 1. EC.txt contains 5 out of 12 lines that should be counted.

INPUT FILE vim EC.txt

(1) Line one, this should count.
(2)Line two. Should also count.
3 should not count..
4 not
(5)Yes.
(6). nope
7 OHHH mann
8 This suck
(9)Oh ya? YOU SUCK.
10 Cheaa
(11) BOI.
(12) WoW MoM. Print mofo.

UPDATED CODE

BEGIN{x=0}
/^$[0-9]+$.*?[A-Za-z]+\.$/{x++}
END{print x}

This gives me 6. I believe its counting line 11 (11) BOI. Working on printing out the lines to make sure.

mklement0 · Accepted Answer

^{For an alternative solution that expresses the intent more simply and clearly and is also locale-aware (doesn't invariably only match ASCII letters), see Ed Morton's helpful answer.}

Try the following (POSIX-compliant):

awk '/^$[0-9]+$.*([A-Z].*[a-z]|[a-z].*[A-Z]).*\.$/ { ++x } END { print x+0 }' file

^$[0-9]+$ matches a decimal number in parentheses at the beginning of a line.
\.$ matches a literal period at the end of a line.
.*([A-Z].*[a-z]|[a-z].*[A-Z]).* matches any string in between that:
- Either: contains at least 1 uppercase letter followed by at least 1 lowercase one.
- Or: contains at least 1 lowercase letter followed by at least 1 uppercase one.
- Thus, this expression should match any string containing any mix of lower- and uppercase [ASCII-only] letters, as long as least 1 uppercase and 1 lowercase letter is present.

As for why your approach didn't work:

Your initial solution attempt, [A-Z][A-z] *[a-z][a-z]*, only matches lines whose first [ASCII] letter on the line is uppercase; in other words: lines where the first letter on the line is lowercase aren't matched.
Your later solution attempt, [A-Za-z]+, due to using a single character set any of whose characters are matched, also matches lines containing only uppercase or lowercase letters, which is why line (11) BOI. also matches.

Ed Morton · Answer

idk if this is the expected output or not since you didn't include that in your question but I just coded what you said in your question count the number of lines begin with a decimal number in parenthesis, containing a mix of both upper and lower case letters and end with a period and added the print so you can see what it matches so take a look and see if it does what you want:

$ cat tst.awk
/^$[0-9]+$/ && /[[:upper:]]/ && /[[:lower:]]/ && /\.$/ { print; cnt++ }
END { print cnt+0 }

$ awk -f tst.awk file
(1) Line one, this should count.
(2)Line two. Should also count.
(5)Yes.
(9)Oh ya? YOU SUCK.
(12) WoW MoM. Print mofo.
5

Don't get stuck thinking that the condition part of an awk statement has to be a regexp, like if this was sed or grep, as it doesn't - it can be a compound condition of ands/ors of regexp segments if that's what makes your code simpler and clearer as in this case IMHO.

AWK program using regex to count matching lines

Tags:

regex

bash

awk

ChrisFocker

2 Answers

mklement0

Ed Morton

Recent Activity

Donate For Us

AWK program using regex to count matching lines

Tags:

regex

bash

awk

ChrisFocker

2 Answers

mklement0

Ed Morton

Related questions

Recent Activity

Donate For Us