Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to properly match word separators in C# without matching additional characters

Apologies for the newb question, but C# isn't my first language.

I am attempting to build an index list of all the separators between words, in a given piece of content, accounting for punctuation. I was hoping to use Regex \b (word 'boundary') but it's matching on all sorts of stuff I wasn't expecting. Here's the method I wrote:

internal static IList<int> GetBreakIndexesInContent(string content)
{
    IList<int> indices = new List<int>();
    if (content != null) 
    {
        foreach (Match match in Regex.Matches(content, @"\b"))
        {
            Console.WriteLine("INDEX:[" + match.Index + "]   CHAR:[" + content.Text[match.Index] + "]   UNICODE:[" + (int)content.Text[match.Index] + "]");
            indices.Add(match.Index);
        }
    }
    return indices;
}

Given the following 100 character string:

"Lorem ipsum dolor sit amet, tritani quaestio suscipiantur mea ea, duo et impedit facilisi evertitur."

I am expecting my method to produce a list that is 14 elements in length, where the first index would be position 5, the second position 11, and so on (ignoring the comma at position 26 and 64, and the period at 99). Instead, this is the output I am getting:

//COUNT: [30]
INDEX:[0]   CHAR:[L]   UNICODE:[76]
INDEX:[5]   CHAR:[ ]   UNICODE:[32]
INDEX:[6]   CHAR:[i]   UNICODE:[105]
INDEX:[11]   CHAR:[ ]   UNICODE:[32]
INDEX:[12]   CHAR:[d]   UNICODE:[100]
INDEX:[17]   CHAR:[ ]   UNICODE:[32]
INDEX:[18]   CHAR:[s]   UNICODE:[115]
INDEX:[21]   CHAR:[ ]   UNICODE:[32]
INDEX:[22]   CHAR:[a]   UNICODE:[97]
INDEX:[26]   CHAR:[,]   UNICODE:[44]
INDEX:[28]   CHAR:[t]   UNICODE:[116]
INDEX:[35]   CHAR:[ ]   UNICODE:[32]
INDEX:[36]   CHAR:[q]   UNICODE:[113]
INDEX:[44]   CHAR:[ ]   UNICODE:[32]
INDEX:[45]   CHAR:[s]   UNICODE:[115]
INDEX:[57]   CHAR:[ ]   UNICODE:[32]
INDEX:[58]   CHAR:[m]   UNICODE:[109]
INDEX:[61]   CHAR:[ ]   UNICODE:[32]
INDEX:[62]   CHAR:[e]   UNICODE:[101]
INDEX:[64]   CHAR:[,]   UNICODE:[44]
INDEX:[66]   CHAR:[d]   UNICODE:[100]
INDEX:[69]   CHAR:[ ]   UNICODE:[32]
INDEX:[70]   CHAR:[e]   UNICODE:[101]
INDEX:[72]   CHAR:[ ]   UNICODE:[32]
INDEX:[73]   CHAR:[i]   UNICODE:[105]
INDEX:[80]   CHAR:[ ]   UNICODE:[32]
INDEX:[81]   CHAR:[f]   UNICODE:[102]
INDEX:[89]   CHAR:[ ]   UNICODE:[32]
INDEX:[90]   CHAR:[e]   UNICODE:[101]
INDEX:[99]   CHAR:[.]   UNICODE:[46]

The reason I am not simply attempting to match on " " or later just filtering for ASCII 32, is because this needs to be sensitive to foreign languages that don't necessarily use whitespace between all words. Also, because I don't want to unintentionally capture multiple spaces as individual "separators".

I was really hoping \b would be a nice standard catch-all for true word separation, but it seems to not be the case. I could "roll my own", but I was hoping I could spare myself the trouble of re-inventing the wheel, if C# already has some sort of facility for handling this problem.

Any help would be appreciated, of course.

Thanks, Greg.

like image 948
Greg Gauthier Avatar asked Dec 04 '25 13:12

Greg Gauthier


2 Answers

If the definition of a word character in regular expressions (\w) meets your needs (for which, read on), you can match non-word characters (e.g., the insterstitial stuff between words by using its inverse character class, \W. The solution could be as simple as

private static readonly Regex rxWord = new Regex( @"\w+" ) ;
static IEnumerable<string> ParseWords( string s )
{
  return rxWord.Matches(s).Cast<Match>().Select( m => m.Value ) ;
}

private static Regex rxNonWord = new Regex( @"\W+" ) ;
private static IEnumerable<string> ParseNonWords( string s )
{
  return rxNonWord.Matches(s).Cast<Match>().Select( m => m.Value ) ;
}

But from what you say you're trying to do, it might be easier to compose your character class or word separators from the Unicode categories that the CLR supports.

Further, using regular expression "word" and "non-word" classes (\w and \W) and the boundary between them (\b) probably won't work, since in regex-speak, a "word" is not necessarily what you think it is. The character class \w started out life as being the set of characters allowed in C-language identifiers ([A-Za-z0-9_]). Very useful if you're a C programmer using regular expressions to grep through source code for symbols. Not so good for rummaging through abitrary text for words.

The current definition of \w in CLR regular expressions is that it matches any character contained in any of these Unicode categories:

  • Li (letter, lower-case)
  • Lu (letter, upper-case)
  • Lt (letter, title-case)
  • Lo (letter, other)
  • Lm (letter, modifier)
  • Nd (number, decimal digit)
  • Pc (punctuation, connector) This category include 10 characters. The one most commonly encountered here, at least in English, is _ (0x005F) aka underscore or LOWLINE.

All of which to say is that \w is the lazy way of writing [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Lm}\p{Nd}\p{Pc}].

The non-word character class \W is the inverse of this. It is the exact equivalent of saying [^\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Lm}\p{Nd}\p{Pc}].

The zero-width anchor \b doesn't "match" anything: like its sisters ^ and $, \b anchors the match to a particular place. In the case of \b, that place is the boundary between a word (\w) and a non-word (\W) character. \b has a cousin, \B that matches the inverse: it anchors the match at the boundary between two word (\w) or two non-word (\W) characters.

So...

You need to first come up with a definition of "word" that fits your problem domain. This is harder than it seems: for instance, is "twenty-three" one or two words? How about "ex-wife"? Or how about a compound word like "abstract expressionism", something that depending on context is either one or two words (You'll find "abstract", "expressionism" and "abstract expressionism" as individual entries in the dictionary).

If you can define a character class that meets that definition, all is well and good. To match the interstitial stuff between your words, all you have to do is define its inverse character class.

If a simple character class won't do you, you'll need to use various look-ahead/look-behind assertions to match what you want.

like image 95
Nicholas Carey Avatar answered Dec 07 '25 07:12

Nicholas Carey


I didn't mean to type such a long comment. I guess I might as well move it to an answer.

\b matches all boundaries between word and non-word characters, i.e. between \w and \W, including between the beginning of the string and your first letter, between letters and spaces (on both sides of the spaces), and so on.

You may need to combine your expression with lookaround assertions to achieve what you want.

For example,

\b(?<=[a-zA-Z])

uses a positive lookbehind assertion to ensure you match only the word boundaries that follow a letter. However, this would consider spaces delimiters, which I'm not sure you want to do, in which case,

\b(?<=[a-zA-Z])(?!\s)

adds an additional condition—this time a negative lookahead assertion to ensure you match only the word boundaries not followed by a space character.

like image 22
slackwing Avatar answered Dec 07 '25 09:12

slackwing