Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex question: Match sequence only n times on a random place

Tags:

regex

grep

I have a regex question, take for example:

  1. ...AAABZBZBCCCDDD...
  2. ...BZBZBDDDBZBZBCCC...

I am looking for a regular expression that matches BZBZB just n times.
in a line. So, if I wanted to match the sequence only once, I should only get the first line as output.

The string occurs on random places in the text. And the regex should be compatible with grep or egrep...

Thanks in advance.

like image 515
3sdmx Avatar asked Jan 07 '11 21:01

3sdmx


People also ask

What does ?= * Mean in regex?

is a positive lookahead, a type of zero-width assertion. What it's saying is that the captured match must be followed by whatever is within the parentheses but that part isn't captured. Your example means the match needs to be followed by zero or more characters and then a digit (but again that part isn't captured).

How do you match a sequence in regex?

How do you match a character sequence in regex? To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches “.” ; regex \+ matches “+” ; and regex \( matches “(” .

Does * match everything in regex?

Throw in an * (asterisk), and it will match everything. Read more. \s (whitespace metacharacter) will match any whitespace character (space; tab; line break; ...), and \S (opposite of \s ) will match anything that is not a whitespace character.

Is used for zero or more occurrences in regex?

A regular expression followed by an asterisk ( * ) matches zero or more occurrences of the regular expression. If there is any choice, the first matching string in a line is used.


2 Answers

grep '\(.*BZBZB\)\{5\}' will do 5 times, but this will match anything which appears 5 times or more because grep checks if any substring of a line matches. Because grep doesn't have any way to do negative matching of strings in its regular expressions (only characters), this cannot be done with a single command unless, for example, you knew that the characters used in the string to be matched were not used elsewhere.

However, you can do this in two grep commands:

cat temp.txt | grep '\(.*BZBZB\)\{5\}' | grep -v '\(.*BZBZB\)\{6\}'

will return lines in which BZBZB appears exactly 5 times. (Basically, it's doing a positive check for 5 or more times and then a negative check for six or more times.)

like image 78
Keith Irwin Avatar answered Oct 13 '22 03:10

Keith Irwin


From the grep man page:

   -m NUM, --max-count=NUM
    Stop  reading  a file after NUM matching lines.  If the input is
    standard input from a regular file, and NUM matching  lines  are
    output,  grep  ensures  that the standard input is positioned to
    just after the last matching line before exiting, regardless  of
    the  presence of trailing context lines.  This enables a calling
    process to resume a search.  When grep stops after NUM  matching
    lines,  it  outputs  any trailing context lines.  When the -c or
    --count option is also  used,  grep  does  not  output  a  count
    greater  than NUM.  When the -v or --invert-match option is also
    used, grep stops after outputting NUM non-matching lines.

So we need two grep expressions:

grep -e "BZ" -o
grep -e "BZ" -m n

The first one finds all instances of "BZ" in the previous string, without including the content around the lines. Each instance is spit out on its own line. The second one takes each line spit out and continues until n lines have been found.

>>>"ABZABZABX" |grep -e "BZ" -o | grep -e "BZ" -m 1
BZ

Hopefully that is what you needed.

like image 21
mklauber Avatar answered Oct 13 '22 03:10

mklauber