How do I split a phrase into words using Regex in C#

Tags:

regex

I am trying to split a sentence/phrase in to words using Regex.

var phrase = "This isn't a test.";
var words = Regex.Split(phrase, @"\W+").ToList();

words contains "This", "isn", "t", "a", "test"

Obviously it's picking up the apostrophe and splitting on that. Can I change this behavior? It also needs to be multilingual supporting a variety of languages (Spanish, French, Russian, Korean, etc...).

I need to pass the words in to a spellchecker. Specifically Nhunspell.

return (from word in words let correct = _engine[langId].Spell(word) where !correct select word).ToList();

621

asked Apr 20 '12 02:04

Dean

4 Answers

If you want to split into words for spell checking purposes, this is a good solution:

new Regex(@"[^\p{L}]*\p{Z}[^\p{L}]*")

Basically you can use Regex.Split using the previous regex. It uses unicode syntax so it would work in several languages (not for most asian though). And it won't break words with apostrophes ot hyphens.

187

answered Nov 10 '22 23:11

Fran Casadome

Due to the fact that a number of languages use very complex rules to string words together into phrases and sentences, you can't rely on a simple Regular Expression to get all the words from a piece of text. Even for a language as 'simple' as English you'll run in a number of corner cases such as:

How to handle words like you're, isn't where there's two words combined and a number of characters replaces with '.
How to handle abbreviations such as Mr. Mrs. i.e.
combined words using '-'
hyphenated words at the end of a sentence.
Names like O'Brian and O'Connel.

Chinese and Japanese (among others) are notoriously hard to parse this way, as these languages do not use spaces between words, only between sentences.

You might want to read up on Text Segmentation and if the segmentation is important to you invest in a Spell Checker that can parse a whole text or a Text Segmentation engine which can split your sentences up into words according to the rules of the language.

I couldn't find a .NET based multi-lingual segmentation engine with a quick google search though. Sorry.

answered Nov 11 '22 00:11

jessehouwing

Use Split().

words = phrase.Split(' ');

Without punctuation.

words = phrase.Split(new Char [] {' ', ',', '.', ':', , ';', '!', '?', '\t'});

answered Nov 11 '22 00:11

Jack

What do you want to split on? Spaces? Punctuation? You have to decide what the stop characters are. A simple regex that uses space and a few punctuation characters would be "[^.?!\s]+". That would split on period, question mark, exclamation, and any whitespace characters.

answered Nov 10 '22 22:11

Jim Mischel

Related questions
                            
                                Pass an array from javascript to c#
                            
                                Best way to generate a function that generates a function in C#
                            
                                There is already an open DataReader associated with this Command which must be closed first
                            
                                Generics without new()
                            
                                Targeted my project in Visual 2010 to .Net 4.0 but the system still looks for the dll 'System.Core version 2.0.5.0'
                            
                                C# PasswordDeriveBytes Confusion
                            
                                How to remove all Click event handlers? [duplicate]
                            
                                How to get username and SID for user by a domain name in ldap
                            
                                In what situation(s) would a reference point to an object that was queued for garbage collection?
                            
                                How to create 303 Response in asp.net
                            
                                The name 'Database' does not exist in the current context?
                            
                                When using Task what happens if the ThreadPool is full/busy?
                            
                                The Invulnerable XMLException
                            
                                how to run an mstest dll from command line
                            
                                Is that possible to divide all elements in C# double list to that double list elements sum (which makes total = 1)
                            
                                Convert xmlstring into XmlNode
                            
                                Any way to show/compare object references in watch window?
                            
                                Is DateTime.Now affected by changing the system clock?
                            
                                Is it possible to get Visual Studio or Resharper to highlight enum's with a different color?
                            
                                ExecuteScalar() returns null altough data was added to DB

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With