Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get all words of a string in c#?

Tags:

string

c#

I have a paragraph in a single string and I'd like to get all the words in that paragraph.

My problem is that I don't want the suffixes words that end with punctuation marks such as (',','.',''','"',';',':','!','?') and /n /t etc.

I also don't want words with 's and 'm such as world's where it should only return world.

In the example he said. "My dog's bone, toy, are missing!"

the list should be: he said my dog bone toy are missing

like image 408
Joseph Lafuente Avatar asked Feb 11 '11 15:02

Joseph Lafuente


People also ask

How do I find a word in a string in C?

Search for a character in a string - strchr & strrchr The strchr function returns the first occurrence of a character within a string. The strrchr returns the last occurrence of a character within a string. They return a character pointer to the character found, or NULL pointer if the character is not found.

How do you print a string in C?

using printf() If we want to do a string output in C stored in memory and we want to output it as it is, then we can use the printf() function. This function, like scanf() uses the access specifier %s to output strings. The complete syntax for this method is: printf("%s", char *s);


1 Answers

Expanding on Shan's answer, I would consider something like this as a starting point:

MatchCollection matches = Regex.Match(input, @"\b[\w']*\b");

Why include the ' character? Because this will prevent words like "we're" from being split into two words. After capturing it, you can manually strip out the suffix yourself (whereas otherwise, you couldn't recognize that re is not a word and ignore it).

So:

static string[] GetWords(string input)
{
    MatchCollection matches = Regex.Matches(input, @"\b[\w']*\b");

    var words = from m in matches.Cast<Match>()
                where !string.IsNullOrEmpty(m.Value)
                select TrimSuffix(m.Value);

    return words.ToArray();
}

static string TrimSuffix(string word)
{
    int apostropheLocation = word.IndexOf('\'');
    if (apostropheLocation != -1)
    {
        word = word.Substring(0, apostropheLocation);
    }

    return word;
}

Example input:

he said. "My dog's bone, toy, are missing!" What're you doing tonight, by the way?

Example output:

[he, said, My, dog, bone, toy, are, missing, What, you, doing, tonight, by, the, way]

One limitation of this approach is that it will not handle acronyms well; e.g., "Y.M.C.A." would be treated as four words. I think that could also be handled by including . as a character to match in a word and then stripping it out if it's a full stop afterwards (i.e., by checking that it's the only period in the word as well as the last character).

like image 117
Dan Tao Avatar answered Oct 24 '22 04:10

Dan Tao