Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

LINQ - Splitting up a string with maximum length, but not chopping words apart

Tags:

c#

linq

I have a simple LINQ Extension Method...

    public static IEnumerable<string> SplitOnLength(this string input, int length)
    {
        int index = 0;
        while (index < input.Length)
        {
            if (index + length < input.Length)
                yield return input.Substring(index, length);
            else
                yield return input.Substring(index);

            index += length;
        }
    }

This takes a string, and it chops it up into a collection of strings that do not exceed the given length.

This works well - however I'd like to go further. It chops words in half. I don't need it to understand anything complicated, I just want it to be able to chop a string off 'early' if cutting it at the length would be cutting in the middle of text (basically anything that isn't whitespace).

However I suck at LINQ, so I was wondering if anyone had an idea on how to go about this. I know what I am trying to do, but I'm not sure how to approach it.

So let's say I have the following text.

This is a sample block of text that I would pass through the string splitter.

I call this method SplitOnLength(6) I would get the following.

  • This i
  • s a sa
  • mple b
  • lock o
  • f text
  • that I
  • would
  • pass t
  • hrough
  • the s
  • tring
  • splitt
  • er.

I would rather it be smart enough to stop and look more like ..

  • This
  • is a
  • sample

// bad example, since the single word exceeds maximum length, but the length would be larger numbers in real scenarios, closer to 200.

Can anyone help me?

like image 525
Ciel Avatar asked Dec 29 '10 17:12

Ciel


3 Answers

Update 2

I noticed in Saeed's other answer that he skipped my suggestion since it threw an exception. This is an interesting topic. Whenever we as developers work on a solution to a problem, we need to consider the exceptional cases—the unexpected inputs, the invalid states, etc. I felt that since the requirement of the question was:

I just want it to be able to chop a string off 'early' if cutting it at the length would be cutting in the middle of text (basically anything that isn't whitespace).

...that, in the event that this is impossible (i.e., in order to return a substring of the specified length it is necessary to cut a word in half because the word is too long), it would be appropriate to throw an exception. But obviously this is a subjective matter. However, realizing that there is more than one way to skin this cat, I have updated my solution (both on pastebin and below) to take a WordPolicy enum rather than a simple bool. This enum has three values: None (equivalent to false from before), ThrowIfTooLong (equivalent to true from before), and CutIfTooLong (if the word must be cut, just cut it).

I also added benchmarks of some of the other answers**, included below. Note that this time around I tested multiple runs with different length parameters (5, 10, 25, 200, 500, 1000). For the shortest length (5), the results seem somewhat even, with Saeed's suggestion on top. As length grows larger, the performance of Saeed's suggestion grows worse and worse. Simon's and Jay's suggestions appear to be significantly more scalable in the case of large inputs.

Also keep in mind that the OP has explicitly said that the value for length would be "closer to 200" in a realistic scenario; so using 200 as an input is not contrived. It is in fact a realistic case.

New Benchmarks

Enter a split size: 5

Results from longer input:
Saeed (original): 2.8886 ms
Saeed: 3.2685 ms
Dan: 3.3163 ms
Simon: 7.4182 ms
Jay: 36.7348 ms

Results from shorter input:
Saeed (original): 0.031 ms
Saeed: 0.0357 ms
Dan: 0.0514 ms
Simon: 0.1047 ms
Jay: 0.2885 ms

Go again? y

Enter a split size: 10

Results from longer input:
Dan: 1.8798 ms
Saeed: 3.8205 ms
Saeed (original): 4.0899 ms
Simon: 4.9869 ms
Jay: 14.9627 ms

Results from shorter input:
Dan: 0.022 ms
Saeed (original): 0.0396 ms
Saeed: 0.0466 ms
Simon: 0.0483 ms
Jay: 0.2022 ms

Go again? y

Enter a split size: 25

Results from longer input:
Dan: 0.6713 ms
Simon: 2.7506 ms
Saeed: 4.5075 ms
Saeed (original): 4.7827 ms
Jay: 6.3477 ms

Results from shorter input:
Dan: 0.0131 ms
Simon: 0.0301 ms
Saeed (original): 0.0441 ms
Saeed: 0.0488 ms
Jay: 0.1176 ms

Go again? y

Enter a split size: 200

Results from longer input:
Dan: 0.1952 ms
Jay: 1.5764 ms
Simon: 1.8175 ms
Saeed (original): 6.8025 ms
Saeed: 7.5221 ms

Results from shorter input:
Dan: 0.0092 ms
Simon: 0.0206 ms
Saeed: 0.0581 ms
Saeed (original): 0.0586 ms
Jay: 0.0812 ms

Go again? y

Enter a split size: 500

Results from longer input:
Dan: 0.1463 ms
Jay: 1.2923 ms
Simon: 1.7326 ms
Saeed (original): 8.686 ms
Saeed: 9.1726 ms

Results from shorter input:
Dan: 0.0061 ms
Simon: 0.0192 ms
Saeed (original): 0.0664 ms
Saeed: 0.0748 ms
Jay: 0.0832 ms

Go again? y

Enter a split size: 1000

Results from longer input:
Dan: 0.1293 ms
Jay: 1.1683 ms
Simon: 1.7292 ms
Saeed: 11.7121 ms
Saeed (original): 11.8752 ms

Results from shorter input:
Dan: 0.0058 ms
Simon: 0.0187 ms
Saeed (original): 0.0765 ms
Jay: 0.0801 ms
Saeed: 0.084 ms

Go again? n

**Unfortunately, with the "large" input (the short story), I couldn't test Jay's original approach which was incredibly costly—an unbelievably deep call stack due to the recursion, plus an insane number of very large string allocations due to the string.Substring call on a huge string.


Update

I hope this doesn't come across as defensive, but I think there is some very misleading information being presented amidst the comments and some other answers here. In particular Saeed's answer, the accepted answer, has some definite efficiency shortcomings. This wouldn't be such a big deal except that Saeed has claimed in some comments that other answers (including this one) are less efficient.

Comparison of performance

First of all, numbers don't lie. Here is a sample program I wrote to test the method below as well as Saeed's method against example inputs, long and short.

And here are some sample results:

Results from longer input:
SplitOnLength: 0.8073 ms
SaeedsOriginalApproach: 4.724 ms
SaeedsApproach: 4.9095 ms

Results from shorter input:
SplitOnLength: 0.0156 ms
SaeedsOriginalApproach: 0.0522 ms
SaeedsApproach: 0.046 ms

SplitOnLength represents my method below, SaeedsOriginalApproach represents the first suggestion in Saeed's answer, and SaeedsApproach represents Saeed's updated answer using deferred execution. The test inputs were:

  • The entire text of Franz Kafka's 'In the Penal Colony' (a short story—long input)
  • An excerpt from the text of the OP's question (short input)
  • A length parameter of 25 (to accommodate some longer words)

Notice that SplitOnLength executed in a fraction of the time of the methods suggested in Saeed's answer. I will concede, however, that the length parameter is likely to have an effect on this. With a smaller length, I wouldn't be surprised to see the performance of SaeedsOriginalApproach and SaeedsApproach improve significantly*.

Explanation

Now, just a few remarks on what I think is going on here. Right off the bat, Saeed's answer starts with a call to string.Split. Now here's something interesting Microsoft has to say about this method:

The Split methods allocate memory for the returned array object and a String object for each array element. If your application requires optimal performance or if managing memory allocation is critical in your application, consider using the IndexOf or IndexOfAny method, and optionally the Compare method, to locate a substring within a string. [emphasis mine]

In other words, the string.Split method is not endorsed by Microsoft as an appropriate way to achieve optimal performance. The reason I highlight the last line is that Saeed seems to be specifically questioning the efficiency of answers that use string.Substring. But this is the most efficient way to solve this problem, period.

Then inside the method suggested by Saeed, we have code that looks like this:

ret2[index] += ' ' + item;

*This is the reason I believe that the performance of Saeed's approach degrades with higher and higher length parameters. As those of us who've been bitten by the performance of string concatenation before know, this is allocating a new string object on every append, which is wasteful. The problem only becomes worse as length gets longer. Notice the SplitOnLength method below does not suffer from this problem.

Another way of looking at this is simply by breaking down the different steps that need to take place. Below, let N be the length of the input string and K be the specified maximum length of a substring.

Saeed's answer

  1. First, string.Split(' '). This requires enumerating over the entire string once, and the allocation of a string object for every word in the string—which is likely to be more than we need (we're almost certainly going to concatenate some of them)—as well as an array to contain these objects. Let the number of strings in the array now be X.
  2. Then there is a second enumeration over the X results of #1. During this enumeration, there are 0–X string concatenations using +=, allocating between 0–X new string objects.

SplitOnLength (below)

  1. The string is likely never enumerated fully. There are at most N comparisons performed, but in a more realistic case the number is less than N. This is because on each iteration the algorithm optimistically goes straight to the next "best" index and only steps backwards if/as necessary to find the most recent whitespace.
  2. The only new string objects allocated are exactly those that are needed, using string.Substring.

string.Substring, by the way, is very fast. It can be because there is no "extra" work to do; the exact length of the substring is known in advance, so there is no waste created during allocation (as would happen with, e.g., the += operator).

Again, the reason I am adding this behemoth update to my answer is to point out the performance issues with Saeed's suggestion and to dispute his claims about the alleged efficiency problems in some of the other answers.


Original Answer

My instinctive approach was to start with what you had and augment it slightly, with the addition of a little bit of code that would:

  1. Check if index + length is followed by a whitespace; and, if not:
  2. Step backwards into the string until finding a whitespace.

Here's the modified SplitOnLength method:

enum WordPolicy
{
    None,
    ThrowIfTooLong,
    CutIfTooLong
}

public static IEnumerable<string> SplitOnLength(this string input, int length, WordPolicy wordPolicy)
{
    int index = 0;
    while (index < input.Length)
    {
        int stepsBackward = 0;

        if (index + length < input.Length)
        {
            if (wordPolicy != WordPolicy.None)
            {
                yield return GetBiggestAllowableSubstring(input, index, length, wordPolicy, out stepsBackward);
            }
            else
            {
                yield return input.Substring(index, length);
            }
        }
        else
        {
            yield return input.Substring(index);
        }

        index += (length - stepsBackward);
    }
}

static string GetBiggestAllowableSubstring(string input, int index, int length, WordPolicy wordPolicy, out int stepsBackward)
{
    stepsBackward = 0;

    int lastIndex = index + length - 1;

    if (!char.IsWhiteSpace(input[lastIndex + 1]))
    {
        int adjustedLastIndex = input.LastIndexOf(' ', lastIndex, length);
        stepsBackward = lastIndex - adjustedLastIndex;
        lastIndex = adjustedLastIndex;
    }

    if (lastIndex == -1)
    {
        if (wordPolicy == WordPolicy.ThrowIfTooLong)
        {
            throw new ArgumentOutOfRangeException("The input string contains at least one word greater in length than the specified length.");
        }
        else
        {
            stepsBackward = 0;
            lastIndex = index + length - 1;
        }
    }

    return input.Substring(index, lastIndex - index + 1);
}

This approach has the advantage of not doing more work than necessary; note the absence of any string.Split call and the fact that string.Substring is only called in the places where the result is actually being returned. In other words, no superfluous string objects are created using this method—only the ones you actually want.

like image 170
Dan Tao Avatar answered Nov 05 '22 05:11

Dan Tao


I'll solve this by for loop:

var ret1 = str.Split(' ');
var ret2 = new List<string>();
ret2.Add("");
int index = 0;
foreach (var item in ret1)
{
    if (item.Length + 1 + ret2[index].Length <= allowedLength)
    {
        ret2[index] += ' ' + item;
        if (ret2[index].Length >= allowedLength)
        {
            ret2.Add("");
            index++;
        }
    }
    else
    {
        ret2.Add(item);
        index++;
    }
}
return ret2;

first I thought about a Zip, but it's not good here.

and differed execution version with yield:

public static IEnumerable<string> SaeedsApproach(this string str, int allowedLength)
{
    var ret1 = str.Split(' ');
    string current = "";
    foreach (var item in ret1)
    {
        if (item.Length + 1 + current.Length <= allowedLength)
        {
            current += ' ' + item;
            if (current.Length >= allowedLength)
            {
                yield return current;
                current = "";
            }
        }
        else
        {
            yield return current;
            current = "";
        }
    }
}
like image 2
Saeed Amiri Avatar answered Nov 05 '22 07:11

Saeed Amiri


Try using String.Split(' ') to turn the single string into an array of individual words. Then, iterate through them, building the longest string that (with the spaces re-added) is less than the limit, append a newline and yield.

like image 2
KeithS Avatar answered Nov 05 '22 06:11

KeithS