Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trim too long words from sentences in C#?

Tags:

c#

I have C# strings which contain sentences. Sometimes these sentences are OK, sometimes they are just user generated random characters. What I would like to do is to trim words inside these sentences. For example given the following string:

var stringWithLongWords = "Here's a text with tooooooooooooo long words";

I would like to run this through a filter:

var trimmed = TrimLongWords(stringWithLongWords, 6);

And to get an output where every word can contain only up to 6 characters:

"Here's a text with tooooo long words"

Any ideas how this could be done with good performance? Is there anything in .NET which could handle this automatically?

I'm currently using the following code:

    private static string TrimLongWords(string original, int maxCount)
    {
        return string.Join(" ", original.Split(' ').Select(x => x.Substring(0, x.Length > maxCount ? maxCount : x.Length)));
    }

Which in theory works, but it provides a bad output if the long word ends with a separator other than space. For example:

This is sweeeeeeeeeeeeeeeet! And something more.

Ends up looking like this:

This is sweeeeeeee And something more.

Update:

OK, the comments were so good that I realized that this may have too many "what ifs". Perhaps it would be better if the separators are forgotten. Instead if a word gets trimmed, it could be shown with three dots. Here's some examples with words trimmed to max 5 characters:

Apocalypse now! -> Apoca... now!

Apocalypse! -> Apoca...

!Example! -> !Exam...

This is sweeeeeeeeeeeeeeeet! And something more. - > This is sweee... And somet... more.

like image 732
Mikael Koskinen Avatar asked Jul 11 '13 11:07

Mikael Koskinen


5 Answers

EDIT: Since the requirements changed I'll stay in spirit with regular expressions:

Regex.Replace(original, string.Format(@"(\p{{L}}{{{0}}})\p{{L}}+", maxLength), "$1...");

Output with maxLength = 6:

Here's a text with tooooo... long words
This is sweeee...! And someth... more.

Old answer below, because I liked the approach, even though it's a little ... messy :-).


I hacked together a little regex replacement to do that. It's in PowerShell for now (for prototyping; I'll convert to C# afterwards):

'Here''s a text with tooooooooooooo long words','This is sweeeeeeeeeeeeeeeet! And something more.' |
  % {
    [Regex]::Replace($_, '(\w*?)(\w)\2{2,}(\w*)',
      {
        $m = $args[0]
        if ($m.Value.Length -gt 6) {
          $l = 6 - $m.Groups[1].Length - $m.Groups[3].Length
          $m.Groups[1].Value + $m.Groups[2].Value * $l + $m.Groups[3].Value
        }
      })
  }

Output is:

Here's a text with tooooo long words
This is sweeet! And something more.

What this does is finding runs of characters (\w for now; should be changed to something sensible) that follow the pattern (something)(repeated character more than two times)(something else). For replacement it uses a function that checks whether the length it's over the desired maximum length, then it calculates how long the repeated part can really be to still fit in the total length and then cuts down only the repeated part to that length.

It's messy. It will fail to truncate words that are otherwise very long (e.g. »something« in the second test sentence) and the set of characters that constitute words needs to be changed as well. Consider this maybe a starting point if you want to go that route, but not a finished solution.

C# Code:

public static string TrimLongWords(this string original, int maxCount)
{
    return Regex.Replace(original, @"(\w*?)(\w)\2{2,}(\w*)",
        delegate(Match m) {
            var first = m.Groups[0].Value;
            var rep = m.Groups[1].Value;
            var last = m.Groups[2].Value;
            if (m.Value.Length > maxCount) {
                var l = maxCount - first.Length - last.Length;
                return first + new string(rep[0], l) + last;
            }
            return m.Value;
        });
}

A nicer option for the character class would probably be something like \p{L}, depending on your needs.

like image 188
Joey Avatar answered Oct 23 '22 17:10

Joey


I'd recommend using a StringBuilder together with loops:

public string TrimLongWords(string input, int maxWordLength)
{
    StringBuilder sb = new StringBuilder(input.Length);
    int currentWordLength = 0;
    bool stopTripleDot = false;
    foreach (char c in input)
    {
        bool isLetter = char.IsLetter(c);
        if (currentWordLength < maxWordLength || !isLetter)
        {
            sb.Append(c);
            stopTripleDot = false;
            if (isLetter)
                currentWordLength++;
            else
                currentWordLength = 0;
        }
        else if (!stopTripleDot)
        {
            sb.Append("...");
            stopTripleDot = true;
        }
    }
    return sb.ToString();
}

This would be faster than Regex or Linq.
Expected results for maxWordLength == 6:

"UltraLongWord"           -> "UltraL..."
"This-is-not-a-long-word" -> "This-is-not-a-long-word"

And the edge-case maxWordLength == 0 would result in:

"Please don't trim me!!!" -> "... ...'... ... ...!!!" // poor, poor string...

[This answer has been updated to accommodate the "..." as requested in the question]

(I just realised that replacing the trimmed substrings with "..." has introduced quite a few bugs, and fixing them has rendered my code a bit bulky, sorry)

like image 44
Nolonar Avatar answered Oct 23 '22 17:10

Nolonar


Try this:

private static string TrimLongWords(string original, int maxCount)
{
   return string.Join(" ", 
   original.Split(' ')
   .Select(x => { 
     var r = Regex.Replace(x, @"\W", ""); 
     return r.Substring(0, r.Length > maxCount ? maxCount : r.Length) + Regex.Replace(x, @"\w", ""); 
   }));
}

Then TrimLongWords("This is sweeeeeeeeeeeeeeeet! And something more.", 5) becomes "This is sweee! And somet more."

like image 37
dav_i Avatar answered Oct 23 '22 17:10

dav_i


You could use regex to find those repetitions:


string test = "This is sweeeeeeeeeeeeeeeet! And sooooooomething more.";
string result = Regex.Replace(test, @"(\w)\1+", delegate(Match match)
{
    string v = match.ToString();
    return v[0].ToString();
});

The result would be:


This is swet! And something more.

And maybe you could check the manipulated words with a spellchecker service: http://wiki.webspellchecker.net/doku.php?id=installationandconfiguration:web_service

like image 2
cansik Avatar answered Oct 23 '22 18:10

cansik


Try this:

class Program
{
    static void Main(string[] args)
    {
        var stringWithLongWords = "Here's a text with tooooooooooooo long words";
        var trimmed = TrimLongWords(stringWithLongWords, 6);
    }

    private static string TrimLongWords(string stringWithLongWords, int p)
    {
        return Regex.Replace(stringWithLongWords, String.Format(@"[\w]{{{0},}}", p), m =>
        {
            return m.Value.Substring(0, p-1) + "...";
        });
    }
}
like image 2
Alex Filipovici Avatar answered Oct 23 '22 18:10

Alex Filipovici