Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split text into words?

Tags:

c#

.net

How to split text into words?

Example text:

'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'

The words in that line are:

  1. Oh
  2. you
  3. can't
  4. help
  5. that
  6. said
  7. the
  8. Cat
  9. we're
  10. all
  11. mad
  12. here
  13. I'm
  14. mad
  15. You're
  16. mad
like image 492
Colonel Panic Avatar asked May 24 '13 00:05

Colonel Panic


People also ask

How do you split text in words in Python?

A string can be split into substrings using the split(param) method. This method is part of the string object. The parameter is optional, but you can split on a specific string or character. Given a sentence, the string can be split into words.

What is text splitting?

Text Splitting is a simple but powerful way to extract fragments of your data when Working with dimension rule results. It is suitable for 'delimited' text data, such as comma-delimited values (CSV), URLs, and other data where several text fragments are separated by a character or string (delimiter) [see below].


2 Answers

Split text on whitespace, then trim punctuation.

var text = "'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'"; var punctuation = text.Where(Char.IsPunctuation).Distinct().ToArray(); var words = text.Split().Select(x => x.Trim(punctuation)); 

Agrees exactly with example.

like image 86
Colonel Panic Avatar answered Sep 23 '22 05:09

Colonel Panic


First, Remove all special characeters:

var fixedInput = Regex.Replace(input, "[^a-zA-Z0-9% ._]", string.Empty); // This regex doesn't support apostrophe so the extension method is better 

Then split it:

var split = fixedInput.Split(' '); 

For a simpler C# solution for removing special characters (that you can easily change), add this extension method (I added a support for an apostrophe):

public static string RemoveSpecialCharacters(this string str) {    var sb = new StringBuilder();    foreach (char c in str) {       if ((c >= '0' && c <= '9') || (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') || c == '\'' || c == ' ') {          sb.Append(c);       }    }    return sb.ToString(); } 

Then use it like so:

var words = input.RemoveSpecialCharacters().Split(' '); 

You'll be surprised to know that this extension method is very efficient (surely much more efficient then the Regex) so I'll suggest you use it ;)

Update

I agree that this is an English only approach but to make it Unicode compatible all you have to do is replace:

(c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') 

With:

char.IsLetter(c) 

Which supports Unicode, .Net Also offers you char.IsSymbol and char.IsLetterOrDigit for the variety of cases

like image 39
Adam Tal Avatar answered Sep 22 '22 05:09

Adam Tal