Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Whole-word matching with regex.h

Tags:

c++

regex

I want a C++ regex that matches "bananas" or "pajamas" but not "bananas2" or "bananaspajamas" or "banana" or basically anything besides those exact two words. So I did this:

#include <regex.h>
#include <stdio.h>
int main()
{
  regex_t rexp;

  int rv = regcomp(&rexp, "\\bbananas\\b|\\bpajamas\\b", REG_EXTENDED | REG_NOSUB);
  if (rv != 0) {
    printf("Abandon hope, all ye who enter here\n");
  }
  regmatch_t match;
  int diditmatch = regexec(&rexp, "bananas", 1, &match, 0);
  printf("%d %d\n", diditmatch, REG_NOMATCH);
}

and it printed 1 1 as if there wasn't a match. What happened? I also tried \bbananas\b|\bpajamas\b for my regex and that failed too.

I asked Whole-word matching using regex about std::regex, but std::regex is awful and slow so I'm trying regex.h.

like image 221
Inquisitive Idiot Avatar asked Jun 29 '15 09:06

Inquisitive Idiot


People also ask

How do you match a word in regex?

To run a “whole words only” search using a regular expression, simply place the word between two word boundaries, as we did with ‹ \bcat\b ›. The first ‹ \b › requires the ‹ c › to occur at the very start of the string, or after a nonword character.

What is H in regex?

Description. The <regex. h> header defines the structures and symbolic constants used by the regcomp(), regexec(), regerror(), and regfree() functions.

What does \b mean in regex?

The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a “word boundary”. This match is zero-length. There are three different positions that qualify as word boundaries: Before the first character in the string, if the first character is a word character.

What does regex 0 * 1 * 0 * 1 * Mean?

Basically (0+1)* mathes any sequence of ones and zeroes. So, in your example (0+1)*1(0+1)* should match any sequence that has 1. It would not match 000 , but it would match 010 , 1 , 111 etc. (0+1) means 0 OR 1.


1 Answers

The POSIX standard specifies neither word boundary syntax nor look-behind and look-ahead syntax (which could be used to emulate a word boundary) for both BRE and ERE. Therefore, it's not possible to write a regex with word boundaries that works across different POSIX-compliant platforms.

For a portable solution, you should consider using PCRE, or Boost.Regex if you plan to code in C++.

Otherwise, you are stuck with a non-portable solution. If you are fine with such restriction, there are several alternatives:

  • If you link with GNU C library, it extends the syntax to include word boundary, among other things: \b (word boundary), \B (non word boundary), \< (start of word), \> (end of word).
  • Some systems extends the BRE and ERE grammar to include [[:<:]] (start of word), [[:>:]] (end of word) syntax.
like image 189
nhahtdh Avatar answered Nov 15 '22 11:11

nhahtdh