Using grep, I am trying to match lines which are comprised of two characters, one followed repeated followed by the other, but only match when the number of first character occurances is equal to the occurrences of the second character.
As an example, imagine that I can only match two characters like '0' and '1'. Now imagine that if there are n '0' characters then there must be n '1' characters following directly afterward. For example:
would all match. But:
wouldn't match.
I've been playing around with capture groups and trawling through perldoc for more info on grep -P but haven't found any leads to solve my problem - with grep at least.
How could I make a grep command to match strings given these constraints?
EDIT:
To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).
GNU grep supports three regular expression syntaxes, Basic, Extended, and Perl-compatible. In its simplest form, when no regular expression type is given, grep interpret search patterns as basic regular expressions. To interpret the pattern as an extended regular expression, use the -E ( or --extended-regexp ) option.
To match a character that is special to grep –E, put a backslash ( \ ) in front of the character. It is usually simpler to use grep –F when you don't need special pattern matching.
Basic and extended regular expressions are two variations on the syntax of the specified pattern. Basic Regular Expression (BRE) syntax is the default in sed (and similarly in grep ). Use the POSIX-specified -E option ( -r , --regexp-extended ) to enable Extended Regular Expression (ERE) syntax.
See EDIT below for update to clarifications
Here's a Perl one-liner instead of grep
perl -wne'print if /^((.)\g{-1}+)((.)\g{-1}+)$/ and length $1 == length $3' file
The length comparison of matches is clearly done outside of regex; I don't see that it can be done inside nicely†, and I don't see anything wrong with using code which isn't regex :)
This doesn't match with a single character (ab
), what wouldn't really make sense and what seems excluded from the question. The anchors (^
and $
) make it so that it can match only strings with two characters, what seems to be specified.
That \g{-1}
is a relative backreference. It matches the same subpattern that was captured last, which is what we need instead of a simple backreference (\g1
).
This is need because \g1
refers to the first capture, the set of parens started first (leftmost), which is the capture of the whole pattern. (We can use \g2
but it is bad practice to count them off.)
This can be made nicer by using named references, but then it would be more elaborate as well.
EDIT Following the clarifications, whereby it must be 0
s first then the same number of 1
s, and 0
-repetitions count (so an empty line), as well as 1
-repetition of course (so 01
). This simplifies matters greatly, for just
perl -wne'print if /^(0*)(1*)$/ and length $1 == length $2' file
The 0
and 1
can be made into variables which can be supplied as external arguments, if desired (so it can be any grammar, a
and b
etc).
It prints as expected on the example input from the question, so on input file
0011 000111 00000000001111111111 01 011 1100 110001
it prints
0011 000111 00000000001111111111 01
(the last empty line in output being the empty line in the middle, after which no more lines match)
† That is, without employing tricky features that run code inside regex, which would make it far more complex. If you still wish to play with that see it in perlre and in perlretut.
Or, this can also be done using recursion in regex, with similar (or little lesser?) complexity.
This awk
one line should do the job:
cat file
0011
000111
00000000001111111111
011
1100
11000
awk '/^0*1*$/ && gsub(/0/, "&") == gsub(/1/, "&")' file
0011
000111
00000000001111111111
Or if you want to print numbers that may have 1
s followed by 0
s then use:
# awk command
awk '/^(0*1*|1*0*)$/ && gsub(/0/, "&") == gsub(/1/, "&")' file
0011
000111
00000000001111111111
1100
gsub
function returns number of replacements.
Since you've used grep
tag, here is one gnu grep
command with -P
(PCRE recursive) regex:
grep -P '^(0(?1)?1|1(?1)?0)?$' file
0011
000111
00000000001111111111
1100
grep RegEx Demo
With your shown samples only, in case you are ok with awk
you could try following.
awk 'match($0,/^0+/){num1=RLENGTH;match($0,/1+/);if(num1==RLENGTH){print}}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
match($0,/^0+/){ ##Using match function to match starting zeroes here.
num1=RLENGTH ##Creating num1 here with rlength.
match($0,/1+/) ##Matching all ones now.
if(num1==RLENGTH){ print } ##Checking condition if num1 is equal to current length then print the line.
}
' Input_file ##mentioning Input_file name here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With