How to split text into words?
Example text:
'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'
The words in that line are:
A string can be split into substrings using the split(param) method. This method is part of the string object. The parameter is optional, but you can split on a specific string or character. Given a sentence, the string can be split into words.
Text Splitting is a simple but powerful way to extract fragments of your data when Working with dimension rule results. It is suitable for 'delimited' text data, such as comma-delimited values (CSV), URLs, and other data where several text fragments are separated by a character or string (delimiter) [see below].
Split text on whitespace, then trim punctuation.
var text = "'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'"; var punctuation = text.Where(Char.IsPunctuation).Distinct().ToArray(); var words = text.Split().Select(x => x.Trim(punctuation));
Agrees exactly with example.
First, Remove all special characeters:
var fixedInput = Regex.Replace(input, "[^a-zA-Z0-9% ._]", string.Empty); // This regex doesn't support apostrophe so the extension method is better
Then split it:
var split = fixedInput.Split(' ');
For a simpler C# solution for removing special characters (that you can easily change), add this extension method (I added a support for an apostrophe):
public static string RemoveSpecialCharacters(this string str) { var sb = new StringBuilder(); foreach (char c in str) { if ((c >= '0' && c <= '9') || (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') || c == '\'' || c == ' ') { sb.Append(c); } } return sb.ToString(); }
Then use it like so:
var words = input.RemoveSpecialCharacters().Split(' ');
You'll be surprised to know that this extension method is very efficient (surely much more efficient then the Regex) so I'll suggest you use it ;)
Update
I agree that this is an English only approach but to make it Unicode compatible all you have to do is replace:
(c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z')
With:
char.IsLetter(c)
Which supports Unicode, .Net Also offers you char.IsSymbol
and char.IsLetterOrDigit
for the variety of cases
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With