I have seen a few similar questions but I am trying to achieve this.
Given a string, str="The moon is our natural satellite, i.e. it rotates around the Earth!" I want to extract the words and store them in an array. The expected array elements would be this.
the
moon
is
our
natural
satellite
i.e.
it
rotates
around
the
earth
I tried using String.split( ','\t','\r') but this does not work correctly. I also tried removing the ., and other punctuation marks but I would want a string like "i.e." to be parsed out too. What is the best way to achieve this? I also tried using regex.split to no avail.
string[] words = Regex.Split(line, @"\W+");
Would surely appreciate some nudges in the right direction.
findall() method to split a string into words and punctuation, e.g. result = re. findall(r"[\w'\"]+|[,.!?] ", my_str) . The findall() method will split the string on whitespace characters and punctuation and will return a list of the matches.
To split a string by a regular expression, pass a regex as a parameter to the split() method, e.g. str. split(/[,. \s]/) . The split method takes a string or regular expression and splits the string based on the provided separator, into an array of substrings.
The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a “word boundary”. This match is zero-length. There are three different positions that qualify as word boundaries: Before the first character in the string, if the first character is a word character.
Use re. sub() , and str. split() to convert a string to a list of words without punctuation.
A regex solution.
(\b[^\s]+\b)
And if you really want to fix that last .
on i.e.
you could use this.
((\b[^\s]+\b)((?<=\.\w).)?)
Here's the code I'm using.
var input = "The moon is our natural satellite, i.e. it rotates around the Earth!";
var matches = Regex.Matches(input, @"((\b[^\s]+\b)((?<=\.\w).)?)");
foreach(var match in matches)
{
Console.WriteLine(match);
}
Results:
The moon is our natural satellite i.e. it rotates around the Earth
I suspect the solution you're looking for is much more complex than you think. You're looking for some form of actual language analysis, or at a minimum a dictionary, so that you can determine whether a period is part of a word or ends a sentence. Have you considered the fact that it may do both?
Consider adding a dictionary of allowed "words that contain punctuation." This may be the simplest way to solve your problem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With