Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split text into sentence even Mr. Mrs. exists in a text

Tags:

c#

split

I have a problem, I want split a text into sentence using fullstop (.)

For instance:

Mr. Bean is a British comedy television series of 14 half-hour episodes starring Rowan Atkinson as the title character. Different episodes were written by Atkinson, Robin Driscoll, Richard Curtis and one by Ben Elton.

If I split the above text, I got 3 sentences like,

1. Mr.

2. Bean is a British comedy television series of 14 half-hour episodes starring Rowan Atkinson as the title character. Different episodes were written by Atkinson, Robin Driscoll, Richard Curtis and one by Ben Elton.

3. Different episodes were written by Atkinson, Robin Driscoll, Richard Curtis and one by Ben Elton.


I want to include Mr. in the second sentence as the text should split into two sentence not to three.

1. Mr. Bean is a British comedy television series of 14 half-hour episodes starring Rowan Atkinson as the title character. Different episodes were written by Atkinson, Robin Driscoll, Richard Curtis and one by Ben Elton.

2. Different episodes were written by Atkinson, Robin Driscoll, Richard Curtis and one by Ben Elton.

Kindly help me. I appreciate the instant feedback from the community.

Thanks.

like image 846
Rais Hussain Avatar asked Dec 14 '25 12:12

Rais Hussain


1 Answers

If you are looking for a way to avoid splitting sentences after an abbreviation (like a.m.), that's a difficult natural language problem.

If you just want to split sentences without worrying about Mr. or Mrs. (and have a character that won't likely show up in the text, like *), here's a simple way:

  1. replace all instances of Mr. and Mrs. with Mr* and Mrs*
  2. split text on .
  3. in the resulting array, replace all instances of Mr* and Mrs* with Mr. and Mrs.

Here's a version that uses NUL as a sentinel character, as it's pretty much impossible for it to show up in text unintentionally:

static IEnumerable<string> Splitter(string sentences)
{
    char sentinel = '\0';
    return sentences.Replace("Mr.", "Mr" + sentinel)
        .Replace("Mrs.", "Mrs" + sentinel)
        .Split(new[] { ". " }, StringSplitOptions.None)
        .Select(s => s.Replace("Mr" + sentinel, "Mr.")
                        .Replace("Mrs" + sentinel, "Mrs."));
}

If you're the paranoid sort of person who thinks any particular character is liable to show up in your text, feel free to use a GUID for the sentinel.

like image 91
Gabe Avatar answered Dec 17 '25 17:12

Gabe



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!