Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to return the word before the match

Tags:

c#

regex

I've been trying to extract the word before the match. For example, I have the following sentence:

"Allatoona was a town located in extreme southeastern Bartow County, Georgia."

I want to extract the word before "Bartow".

I've tried the following regex to extract that word:

\w\sCounty,

What I get returned is "w County" when what I wanted is just the word Bartow.

Any assistance would be greatly appreciated. Thanks!

like image 362
Andy Evans Avatar asked Jun 18 '17 04:06

Andy Evans


People also ask

What does ?= * Mean in regex?

Save this question. . means match any character in regular expressions. * means zero or more occurrences of the SINGLE regex preceding it. My alphabet.txt contains a line abcdefghijklmnopqrstuvwxyz.

What does \b mean in regex?

The word boundary \b matches positions where one side is a word character (usually a letter, digit or underscore—but see below for variations across engines) and the other side is not a word character (for instance, it may be the beginning of the string or a space character).

What is \r and \n in regex?

Regex recognizes common escape sequences such as \n for newline, \t for tab, \r for carriage-return, \nnn for a up to 3-digit octal number, \xhh for a two-digit hex code, \uhhhh for a 4-digit Unicode, \uhhhhhhhh for a 8-digit Unicode.

How do you match line breaks in regex?

If you want to indicate a line break when you construct your RegEx, use the sequence “\r\n”. Whether or not you will have line breaks in your expression depends on what you are trying to match. Line breaks can be useful “anchors” that define where some pattern occurs in relation to the beginning or end of a line.


2 Answers

You can use this regex with a lookahead to find word before County:

\w+(?=\s+County)

(?=\s+County) is a positive lookahead that asserts presence of 1 or more whitespaces followed by word County ahead of current match.

RegEx Demo

If you want to avoid lookahead then you can use a capture group:

(\w+)\s+County

and extract captured group #1 from match result.

like image 196
anubhava Avatar answered Oct 05 '22 09:10

anubhava


Your \w\sCounty, regex returns w County because \w matches a single character that is either a letter, digit, or _. It does not match a whole word.

To match 1 or more symbols, you need to use a + quantifier and to capture the part you need to extract you can rely on capturing groups, (...).

So, you can fix your pattern by mere replacing \w with (\w+) and then, after getting a match, access the Match.Groups[1].Value.

However, if the county name contains a non-word symbol, like a hyphen, \w+ won't match it. A \S+ matching 1 or more non-whitespace symbols might turn out a better option in that case.

See a C# demo:

var m = Regex.Match(s, @"(\S+)\s+County");
if (m.Success) 
{
     Console.WriteLine(m.Groups[1].Value);  
}

See a regex demo.enter image description here

like image 37
Wiktor Stribiżew Avatar answered Oct 05 '22 08:10

Wiktor Stribiżew