Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I split a phrase into words using Regex in C#

Tags:

c#

regex

I am trying to split a sentence/phrase in to words using Regex.

var phrase = "This isn't a test.";
var words = Regex.Split(phrase, @"\W+").ToList();

words contains "This", "isn", "t", "a", "test"

Obviously it's picking up the apostrophe and splitting on that. Can I change this behavior? It also needs to be multilingual supporting a variety of languages (Spanish, French, Russian, Korean, etc...).

I need to pass the words in to a spellchecker. Specifically Nhunspell.

return (from word in words let correct = _engine[langId].Spell(word) where !correct select word).ToList();
like image 621
Dean Avatar asked Apr 20 '12 02:04

Dean


People also ask

How do you separate words in regex?

To split a string by a regular expression, pass a regex as a parameter to the split() method, e.g. str. split(/[,. \s]/) . The split method takes a string or regular expression and splits the string based on the provided separator, into an array of substrings.

Can you split with regex?

Split by regex: re. If you want to split a string that matches a regular expression (regex) instead of perfect match, use the split() of the re module. In re. split() , specify the regex pattern in the first parameter and the target character string in the second parameter.

How does regex split work?

Split(String, Int32, Int32) Splits an input string a specified maximum number of times into an array of substrings, at the positions defined by a regular expression specified in the Regex constructor. The search for the regular expression pattern starts at a specified character position in the input string.


4 Answers

If you want to split into words for spell checking purposes, this is a good solution:

new Regex(@"[^\p{L}]*\p{Z}[^\p{L}]*")

Basically you can use Regex.Split using the previous regex. It uses unicode syntax so it would work in several languages (not for most asian though). And it won't break words with apostrophes ot hyphens.

like image 187
Fran Casadome Avatar answered Nov 10 '22 23:11

Fran Casadome


Due to the fact that a number of languages use very complex rules to string words together into phrases and sentences, you can't rely on a simple Regular Expression to get all the words from a piece of text. Even for a language as 'simple' as English you'll run in a number of corner cases such as:

  • How to handle words like you're, isn't where there's two words combined and a number of characters replaces with '.
  • How to handle abbreviations such as Mr. Mrs. i.e.
  • combined words using '-'
  • hyphenated words at the end of a sentence.
  • Names like O'Brian and O'Connel.

Chinese and Japanese (among others) are notoriously hard to parse this way, as these languages do not use spaces between words, only between sentences.

You might want to read up on Text Segmentation and if the segmentation is important to you invest in a Spell Checker that can parse a whole text or a Text Segmentation engine which can split your sentences up into words according to the rules of the language.

I couldn't find a .NET based multi-lingual segmentation engine with a quick google search though. Sorry.

like image 42
jessehouwing Avatar answered Nov 11 '22 00:11

jessehouwing


Use Split().

words = phrase.Split(' ');

Without punctuation.

words = phrase.Split(new Char [] {' ', ',', '.', ':', , ';', '!', '?', '\t'});
like image 21
Jack Avatar answered Nov 11 '22 00:11

Jack


What do you want to split on? Spaces? Punctuation? You have to decide what the stop characters are. A simple regex that uses space and a few punctuation characters would be "[^.?!\s]+". That would split on period, question mark, exclamation, and any whitespace characters.

like image 44
Jim Mischel Avatar answered Nov 10 '22 22:11

Jim Mischel