Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split sentence into words but having trouble with the punctuations in C#

I have seen a few similar questions but I am trying to achieve this.

Given a string, str="The moon is our natural satellite, i.e. it rotates around the Earth!" I want to extract the words and store them in an array. The expected array elements would be this.

the 
moon 
is 
our 
natural 
satellite 
i.e. 
it  
rotates 
around 
the 
earth

I tried using String.split( ','\t','\r') but this does not work correctly. I also tried removing the ., and other punctuation marks but I would want a string like "i.e." to be parsed out too. What is the best way to achieve this? I also tried using regex.split to no avail.

string[] words = Regex.Split(line, @"\W+");

Would surely appreciate some nudges in the right direction.

like image 815
Richard N Avatar asked Sep 05 '11 18:09

Richard N


People also ask

How do you split a string in words and punctuation?

findall() method to split a string into words and punctuation, e.g. result = re. findall(r"[\w'\"]+|[,.!?] ", my_str) . The findall() method will split the string on whitespace characters and punctuation and will return a list of the matches.

How do you separate words in regex?

To split a string by a regular expression, pass a regex as a parameter to the split() method, e.g. str. split(/[,. \s]/) . The split method takes a string or regular expression and splits the string based on the provided separator, into an array of substrings.

What is b regex?

The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a “word boundary”. This match is zero-length. There are three different positions that qualify as word boundaries: Before the first character in the string, if the first character is a word character.

How do you split a string without punctuation in Python?

Use re. sub() , and str. split() to convert a string to a list of words without punctuation.


2 Answers

A regex solution.

(\b[^\s]+\b)

And if you really want to fix that last . on i.e. you could use this.

((\b[^\s]+\b)((?<=\.\w).)?)

Here's the code I'm using.

  var input = "The moon is our natural satellite, i.e. it rotates around the Earth!";
  var matches = Regex.Matches(input, @"((\b[^\s]+\b)((?<=\.\w).)?)");

  foreach(var match in matches)
  {
     Console.WriteLine(match);
  }

Results:

The
moon
is
our
natural
satellite
i.e.
it
rotates
around
the
Earth
like image 94
TheCodeKing Avatar answered Oct 06 '22 00:10

TheCodeKing


I suspect the solution you're looking for is much more complex than you think. You're looking for some form of actual language analysis, or at a minimum a dictionary, so that you can determine whether a period is part of a word or ends a sentence. Have you considered the fact that it may do both?

Consider adding a dictionary of allowed "words that contain punctuation." This may be the simplest way to solve your problem.

like image 35
Greg D Avatar answered Oct 05 '22 23:10

Greg D