I have a simple LINQ Extension Method...
public static IEnumerable<string> SplitOnLength(this string input, int length)
{
int index = 0;
while (index < input.Length)
{
if (index + length < input.Length)
yield return input.Substring(index, length);
else
yield return input.Substring(index);
index += length;
}
}
This takes a string, and it chops it up into a collection of strings that do not exceed the given length.
This works well - however I'd like to go further. It chops words in half. I don't need it to understand anything complicated, I just want it to be able to chop a string off 'early' if cutting it at the length
would be cutting in the middle of text (basically anything that isn't whitespace).
However I suck at LINQ, so I was wondering if anyone had an idea on how to go about this. I know what I am trying to do, but I'm not sure how to approach it.
So let's say I have the following text.
This is a sample block of text that I would pass through the string splitter.
I call this method SplitOnLength(6)
I would get the following.
I would rather it be smart enough to stop and look more like ..
// bad example, since the single word exceeds maximum length, but the length would be larger numbers in real scenarios, closer to 200.
Can anyone help me?
I noticed in Saeed's other answer that he skipped my suggestion since it threw an exception. This is an interesting topic. Whenever we as developers work on a solution to a problem, we need to consider the exceptional cases—the unexpected inputs, the invalid states, etc. I felt that since the requirement of the question was:
I just want it to be able to chop a string off 'early' if cutting it at the length would be cutting in the middle of text (basically anything that isn't whitespace).
...that, in the event that this is impossible (i.e., in order to return a substring of the specified length it is necessary to cut a word in half because the word is too long), it would be appropriate to throw an exception. But obviously this is a subjective matter. However, realizing that there is more than one way to skin this cat, I have updated my solution (both on pastebin and below) to take a WordPolicy
enum rather than a simple bool
. This enum has three values: None
(equivalent to false
from before), ThrowIfTooLong
(equivalent to true
from before), and CutIfTooLong
(if the word must be cut, just cut it).
I also added benchmarks of some of the other answers**, included below. Note that this time around I tested multiple runs with different length
parameters (5, 10, 25, 200, 500, 1000). For the shortest length
(5), the results seem somewhat even, with Saeed's suggestion on top. As length
grows larger, the performance of Saeed's suggestion grows worse and worse. Simon's and Jay's suggestions appear to be significantly more scalable in the case of large inputs.
Also keep in mind that the OP has explicitly said that the value for length
would be "closer to 200" in a realistic scenario; so using 200 as an input is not contrived. It is in fact a realistic case.
Enter a split size: 5 Results from longer input: Saeed (original): 2.8886 ms Saeed: 3.2685 ms Dan: 3.3163 ms Simon: 7.4182 ms Jay: 36.7348 ms Results from shorter input: Saeed (original): 0.031 ms Saeed: 0.0357 ms Dan: 0.0514 ms Simon: 0.1047 ms Jay: 0.2885 ms Go again? y Enter a split size: 10 Results from longer input: Dan: 1.8798 ms Saeed: 3.8205 ms Saeed (original): 4.0899 ms Simon: 4.9869 ms Jay: 14.9627 ms Results from shorter input: Dan: 0.022 ms Saeed (original): 0.0396 ms Saeed: 0.0466 ms Simon: 0.0483 ms Jay: 0.2022 ms Go again? y Enter a split size: 25 Results from longer input: Dan: 0.6713 ms Simon: 2.7506 ms Saeed: 4.5075 ms Saeed (original): 4.7827 ms Jay: 6.3477 ms Results from shorter input: Dan: 0.0131 ms Simon: 0.0301 ms Saeed (original): 0.0441 ms Saeed: 0.0488 ms Jay: 0.1176 ms Go again? y Enter a split size: 200 Results from longer input: Dan: 0.1952 ms Jay: 1.5764 ms Simon: 1.8175 ms Saeed (original): 6.8025 ms Saeed: 7.5221 ms Results from shorter input: Dan: 0.0092 ms Simon: 0.0206 ms Saeed: 0.0581 ms Saeed (original): 0.0586 ms Jay: 0.0812 ms Go again? y Enter a split size: 500 Results from longer input: Dan: 0.1463 ms Jay: 1.2923 ms Simon: 1.7326 ms Saeed (original): 8.686 ms Saeed: 9.1726 ms Results from shorter input: Dan: 0.0061 ms Simon: 0.0192 ms Saeed (original): 0.0664 ms Saeed: 0.0748 ms Jay: 0.0832 ms Go again? y Enter a split size: 1000 Results from longer input: Dan: 0.1293 ms Jay: 1.1683 ms Simon: 1.7292 ms Saeed: 11.7121 ms Saeed (original): 11.8752 ms Results from shorter input: Dan: 0.0058 ms Simon: 0.0187 ms Saeed (original): 0.0765 ms Jay: 0.0801 ms Saeed: 0.084 ms Go again? n
**Unfortunately, with the "large" input (the short story), I couldn't test Jay's original approach which was incredibly costly—an unbelievably deep call stack due to the recursion, plus an insane number of very large string allocations due to the string.Substring
call on a huge string.
I hope this doesn't come across as defensive, but I think there is some very misleading information being presented amidst the comments and some other answers here. In particular Saeed's answer, the accepted answer, has some definite efficiency shortcomings. This wouldn't be such a big deal except that Saeed has claimed in some comments that other answers (including this one) are less efficient.
First of all, numbers don't lie. Here is a sample program I wrote to test the method below as well as Saeed's method against example inputs, long and short.
And here are some sample results:
Results from longer input: SplitOnLength: 0.8073 ms SaeedsOriginalApproach: 4.724 ms SaeedsApproach: 4.9095 ms Results from shorter input: SplitOnLength: 0.0156 ms SaeedsOriginalApproach: 0.0522 ms SaeedsApproach: 0.046 ms
SplitOnLength represents my method below, SaeedsOriginalApproach represents the first suggestion in Saeed's answer, and SaeedsApproach represents Saeed's updated answer using deferred execution. The test inputs were:
length
parameter of 25 (to accommodate some longer words)Notice that SplitOnLength executed in a fraction of the time of the methods suggested in Saeed's answer. I will concede, however, that the length
parameter is likely to have an effect on this. With a smaller length
, I wouldn't be surprised to see the performance of SaeedsOriginalApproach and SaeedsApproach improve significantly*.
Now, just a few remarks on what I think is going on here. Right off the bat, Saeed's answer starts with a call to string.Split
. Now here's something interesting Microsoft has to say about this method:
The Split methods allocate memory for the returned array object and a String object for each array element. If your application requires optimal performance or if managing memory allocation is critical in your application, consider using the IndexOf or IndexOfAny method, and optionally the Compare method, to locate a substring within a string. [emphasis mine]
In other words, the string.Split
method is not endorsed by Microsoft as an appropriate way to achieve optimal performance. The reason I highlight the last line is that Saeed seems to be specifically questioning the efficiency of answers that use string.Substring
. But this is the most efficient way to solve this problem, period.
Then inside the method suggested by Saeed, we have code that looks like this:
ret2[index] += ' ' + item;
*This is the reason I believe that the performance of Saeed's approach degrades with higher and higher length parameters. As those of us who've been bitten by the performance of string concatenation before know, this is allocating a new string object on every append, which is wasteful. The problem only becomes worse as length
gets longer. Notice the SplitOnLength
method below does not suffer from this problem.
Another way of looking at this is simply by breaking down the different steps that need to take place. Below, let N be the length of the input string and K be the specified maximum length of a substring.
string.Split(' ')
. This requires enumerating over the entire string once, and the allocation of a string object for every word in the string—which is likely to be more than we need (we're almost certainly going to concatenate some of them)—as well as an array to contain these objects. Let the number of strings in the array now be X.+=
, allocating between 0–X new string objects.SplitOnLength
(below)string.Substring
.string.Substring
, by the way, is very fast. It can be because there is no "extra" work to do; the exact length of the substring is known in advance, so there is no waste created during allocation (as would happen with, e.g., the +=
operator).
Again, the reason I am adding this behemoth update to my answer is to point out the performance issues with Saeed's suggestion and to dispute his claims about the alleged efficiency problems in some of the other answers.
My instinctive approach was to start with what you had and augment it slightly, with the addition of a little bit of code that would:
index + length
is followed by a whitespace; and, if not:Here's the modified SplitOnLength
method:
enum WordPolicy
{
None,
ThrowIfTooLong,
CutIfTooLong
}
public static IEnumerable<string> SplitOnLength(this string input, int length, WordPolicy wordPolicy)
{
int index = 0;
while (index < input.Length)
{
int stepsBackward = 0;
if (index + length < input.Length)
{
if (wordPolicy != WordPolicy.None)
{
yield return GetBiggestAllowableSubstring(input, index, length, wordPolicy, out stepsBackward);
}
else
{
yield return input.Substring(index, length);
}
}
else
{
yield return input.Substring(index);
}
index += (length - stepsBackward);
}
}
static string GetBiggestAllowableSubstring(string input, int index, int length, WordPolicy wordPolicy, out int stepsBackward)
{
stepsBackward = 0;
int lastIndex = index + length - 1;
if (!char.IsWhiteSpace(input[lastIndex + 1]))
{
int adjustedLastIndex = input.LastIndexOf(' ', lastIndex, length);
stepsBackward = lastIndex - adjustedLastIndex;
lastIndex = adjustedLastIndex;
}
if (lastIndex == -1)
{
if (wordPolicy == WordPolicy.ThrowIfTooLong)
{
throw new ArgumentOutOfRangeException("The input string contains at least one word greater in length than the specified length.");
}
else
{
stepsBackward = 0;
lastIndex = index + length - 1;
}
}
return input.Substring(index, lastIndex - index + 1);
}
This approach has the advantage of not doing more work than necessary; note the absence of any string.Split
call and the fact that string.Substring
is only called in the places where the result is actually being returned. In other words, no superfluous string objects are created using this method—only the ones you actually want.
I'll solve this by for loop:
var ret1 = str.Split(' ');
var ret2 = new List<string>();
ret2.Add("");
int index = 0;
foreach (var item in ret1)
{
if (item.Length + 1 + ret2[index].Length <= allowedLength)
{
ret2[index] += ' ' + item;
if (ret2[index].Length >= allowedLength)
{
ret2.Add("");
index++;
}
}
else
{
ret2.Add(item);
index++;
}
}
return ret2;
first I thought about a Zip, but it's not good here.
and differed execution version with yield:
public static IEnumerable<string> SaeedsApproach(this string str, int allowedLength)
{
var ret1 = str.Split(' ');
string current = "";
foreach (var item in ret1)
{
if (item.Length + 1 + current.Length <= allowedLength)
{
current += ' ' + item;
if (current.Length >= allowedLength)
{
yield return current;
current = "";
}
}
else
{
yield return current;
current = "";
}
}
}
Try using String.Split(' ')
to turn the single string into an array of individual words. Then, iterate through them, building the longest string that (with the spaces re-added) is less than the limit, append a newline and yield.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With