For example, in this text:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc eu tellus vel nunc pretium lacinia. Proin sed lorem. Cras sed ipsum. Nunc a libero quis risus sollicitudin imperdiet.
I want to match the word after 'ipsum'.
To run a “whole words only” search using a regular expression, simply place the word between two word boundaries, as we did with ‹ \bcat\b ›. The first ‹ \b › requires the ‹ c › to occur at the very start of the string, or after a nonword character.
The meta character “^” matches the beginning of a particular string i.e. it matches the first character of the string. For example, The expression “^\d” matches the string/line starting with a digit. The expression “^[a-z]” matches the string/line starting with a lower case alphabet.
'^' matches the start in most regex implementations.
The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a “word boundary”. This match is zero-length. There are three different positions that qualify as word boundaries: Before the first character in the string, if the first character is a word character.
This sounds like a job for lookbehinds, though you should be aware that not all regex flavors support them. In your example:
(?<=\bipsum\s)(\w+)
This will match any sequence of letter characters which follows "ipsum" as a whole word followed by a space. It does not match "ipsum" itself, you don't need to worry about reinserting it in the case of, e.g. replacements.
As I said, though, some flavors (JavaScript, for example) don't support lookbehind at all. Many others (most, in fact) only support "fixed width" lookbehinds — so you could use this example but not any of the repetition operators. (In other words, (?<=\b\w+\s+)(\w+)
wouldn't work.)
Some of the other responders have suggested using a regex that doesn't depend on lookbehinds, but I think a complete, working example is needed to get the point across. The idea is that you match the whole sequence ("ipsum" plus the next word) in the normal way, then use a capturing group to isolate the part that interests you. For example:
String s = "Lorem ipsum dolor sit amet, consectetur " + "adipiscing elit. Nunc eu tellus vel nunc pretium " + "lacinia. Proin sed lorem. Cras sed ipsum. Nunc " + "a libero quis risus sollicitudin imperdiet."; Pattern p = Pattern.compile("ipsum\\W+(\\w+)"); Matcher m = p.matcher(s); while (m.find()) { System.out.println(m.group(1)); }
Note that this prints both "dolor" and "Nunc". To do that with the lookbehind version, you would have to do something hackish like:
Pattern p = Pattern.compile("(?<=ipsum\\W{1,2})(\\w+)");
That's in Java, which requires the lookbehind to have an obvious maximum length. Some flavors don't have even that much flexibility, and of course, some don't support lookbehinds at all.
However, the biggest problem people seem to be having in their examples is not with lookbehinds, but with word boundaries. Both David Kemp and ck seem to expect \b
to match the space character following the 'm', but it doesn't; it matches the position (or boundary) between the 'm' and the space.
It's a common mistake, one I've even seen repeated in a few books and tutorials, but the word-boundary construct, \b
, never matches any characters. It's a zero-width assertion, like lookarounds and anchors (^
, $
, \z
, etc.), and what it matches is a position that is either preceded by a word character and not followed by one, or followed by a word character and not preceded by one.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With