I have used the below code to split the string, but it takes a lot of time. <pre class="prettyprint"><code>using (StreamReader srSegmentData = new StreamReader(fileNamePath)) { string strSegmentData = ""; string line = srSegmentData.ReadToEnd(); int startPos = 0; ArrayList alSegments = new ArrayList(); while (startPos < line.Length && (line.Length - startPos) >= segmentSize) { strSegmentData = strSegmentData + line.Substring(startPos, segmentSize) + Environment.NewLine; alSegments.Add(line.Substring(startPos, segmentSize) + Environment.NewLine); startPos = startPos + segmentSize; } } </code></pre> Please suggest me an alternative way to split the string into smaller chunks of fixed size

First of all you should define what you mean with chunk size. If you mean chunks with a fixed number of code units then your actual algorithm may be slow but it works. If it's not what you intend and you actually mean chunks with a fixed number of characters then it's broken. I discussed a similar issue in this Code Review post: Split a string into chunks of the same length then I will repeat here only relevant parts. <ul> <li> You're partitioning over <code>Char</code> but <code>String</code> is UTF-16 encoded then you may produce broken strings in, at least, three cases: <ol> <li> One character is encoded with more than one code unit. Unicode code point for that character is encoded as two UTF-16 code units, each code unit may end up in two different slices (and both strings will be invalid).</li> <li> One character is composed by more than one code point. You're dealing with a character made by two separate Unicode code points (for example Han character 𠀑).</li> <li> One character has combining characters or modifiers. This is more common than you may think: for example Unicode combining character like U+0300 COMBINING GRAVE ACCENT used to build <kbd>à</kbd> and Unicode modifiers such as U+02BC MODIFIER LETTER APOSTROPHE.</li> </ol> </li> <li>Definition of character for a programming language and for a human being are pretty different, for example in Slovak dž is a single character however it's made by 2/3 Unicode code points which are in this case also 2/3 UTF-16 code units then <code>"dž".Length > 1</code>. More about this and other cultural issues on How can I perform a Unicode aware character by character comparison?.</li> <li>Ligatures exist. Assuming one ligature is one code point (and also assuming it's encoded as one code unit) then you will treat it as a single glyph however it represents two characters. What to do in this case? In general definition of character may be pretty vague because it has a different meaning according to discipline where this word is used. You can't (probably) handle everything correctly but you should set some constraints and document code behavior.</li> </ul> One proposed (and untested) implementation may be this: <pre class="prettyprint"><code>public static IEnumerable<string> Split(this string value, int desiredLength) { var characters = StringInfo.GetTextElementEnumerator(value); while (characters.MoveNext()) yield return String.Concat(Take(characters, desiredLength)); } private static IEnumerable<string> Take(TextElementEnumerator enumerator, int count) { for (int i = 0; i < count; ++i) { yield return (string)enumerator.Current; if (!enumerator.MoveNext()) yield break; } } </code></pre> It's not optimized for speed (as you can see I tried to keep code short and clear using enumerations) but, for big files, it still perform better than your implementation (see next paragraph for the reason). About your code note that: <ul> <li>You're building a huge <code>ArrayList</code> (?!) to hold result. Also note that in this way you resize <code>ArrayList</code> multiple times (even if, given input size and chunk size then its final size is known).</li> <li> <code>strSegmentData</code> is rebuilt multiple times, if you need to accumulate characters you must use <code>StringBuilder</code> otherwise each operation will allocate a new string and copying old value (it's slow and it also adds pressure to Garbage Collector).</li> </ul> There are faster implementations (see linked Code Review post, especially Heslacher's implementation for a much faster version) and if you do not need to handle Unicode correctly (you're sure you manage only US ASCII characters) then there is also a pretty readable implementation from Jon Skeet (note that, after profiling your code, you may still improve its performance for big files pre-allocating right size output list). I do not repeat their code here then please refer to linked posts. In your specific you do not need to read entire huge file in memory, you can read/parse n characters at time (don't worry too much about disk access, I/O is buffered). It will slightly degrade performance but it will greatly improve memory usage. Alternatively you can read line by line (managing to handle cross-line chunks).

Fastest way to split a huge text into smaller chunks

Tags:

substring

c#

.net

I have used the below code to split the string, but it takes a lot of time.

Click to copy

using (StreamReader srSegmentData = new StreamReader(fileNamePath))
{
    string strSegmentData = "";
    string line = srSegmentData.ReadToEnd();
    int startPos = 0;

    ArrayList alSegments = new ArrayList();
    while (startPos < line.Length && (line.Length - startPos) >= segmentSize)
    {
        strSegmentData = strSegmentData + line.Substring(startPos, segmentSize) + Environment.NewLine;
        alSegments.Add(line.Substring(startPos, segmentSize) + Environment.NewLine);
        startPos = startPos + segmentSize;
    }
}

Please suggest me an alternative way to split the string into smaller chunks of fixed size

351

asked Dec 23 '15 07:12

Shankar Anumula

1 Answers

First of all you should define what you mean with chunk size. If you mean chunks with a fixed number of code units then your actual algorithm may be slow but it works. If it's not what you intend and you actually mean chunks with a fixed number of characters then it's broken. I discussed a similar issue in this Code Review post: Split a string into chunks of the same length then I will repeat here only relevant parts.

You're partitioning over Char but String is UTF-16 encoded then you may produce broken strings in, at least, three cases:
1. One character is encoded with more than one code unit. Unicode code point for that character is encoded as two UTF-16 code units, each code unit may end up in two different slices (and both strings will be invalid).
2. One character is composed by more than one code point. You're dealing with a character made by two separate Unicode code points (for example Han character 𠀑).
3. One character has combining characters or modifiers. This is more common than you may think: for example Unicode combining character like U+0300 COMBINING GRAVE ACCENT used to build à and Unicode modifiers such as U+02BC MODIFIER LETTER APOSTROPHE.
Definition of character for a programming language and for a human being are pretty different, for example in Slovak dž is a single character however it's made by 2/3 Unicode code points which are in this case also 2/3 UTF-16 code units then "dž".Length > 1. More about this and other cultural issues on How can I perform a Unicode aware character by character comparison?.
Ligatures exist. Assuming one ligature is one code point (and also assuming it's encoded as one code unit) then you will treat it as a single glyph however it represents two characters. What to do in this case? In general definition of character may be pretty vague because it has a different meaning according to discipline where this word is used. You can't (probably) handle everything correctly but you should set some constraints and document code behavior.

One proposed (and untested) implementation may be this:

Click to copy

public static IEnumerable<string> Split(this string value, int desiredLength)
{
    var characters = StringInfo.GetTextElementEnumerator(value);
    while (characters.MoveNext())
        yield return String.Concat(Take(characters, desiredLength));
}

private static IEnumerable<string> Take(TextElementEnumerator enumerator, int count)
{
    for (int i = 0; i < count; ++i)
    {
        yield return (string)enumerator.Current;

        if (!enumerator.MoveNext())
            yield break;
    }
}

It's not optimized for speed (as you can see I tried to keep code short and clear using enumerations) but, for big files, it still perform better than your implementation (see next paragraph for the reason).

About your code note that:

You're building a huge ArrayList (?!) to hold result. Also note that in this way you resize ArrayList multiple times (even if, given input size and chunk size then its final size is known).
strSegmentData is rebuilt multiple times, if you need to accumulate characters you must use StringBuilder otherwise each operation will allocate a new string and copying old value (it's slow and it also adds pressure to Garbage Collector).

There are faster implementations (see linked Code Review post, especially Heslacher's implementation for a much faster version) and if you do not need to handle Unicode correctly (you're sure you manage only US ASCII characters) then there is also a pretty readable implementation from Jon Skeet (note that, after profiling your code, you may still improve its performance for big files pre-allocating right size output list). I do not repeat their code here then please refer to linked posts.

In your specific you do not need to read entire huge file in memory, you can read/parse n characters at time (don't worry too much about disk access, I/O is buffered). It will slightly degrade performance but it will greatly improve memory usage. Alternatively you can read line by line (managing to handle cross-line chunks).

115

answered Sep 20 '22 12:09

Adriano Repetti

Related questions
                            
                                How to update fields in headers and footers, not just main document?
                            
                                Re-queue message on exception
                            
                                Is there any way to make Resharper treat Trace.Assert like Debug.Assert?
                            
                                Windows Universal (UWP) Geolocation API Permissions
                            
                                How to dial a Number in C# windows universal 10
                            
                                wait for a Task that calls an async method to complete without blocking thread
                            
                                How to debug startup exceptions with service hosted in IIS?
                            
                                Windows 10 Styled ContextMenuStrip
                            
                                Reactive extension fixed Interval between async calls when call is longer than Interval length
                            
                                How to compare and convert emoji characters in C#
                            
                                Get class DisplayName attribute value
                            
                                FileSystemWatcher skips some events
                            
                                Why I can't use lambda expression inside Tuple.Create?
                            
                                C# IEnumerable.Count() throws IndexOutOfRangeException
                            
                                What are the differences between Process.Close() and Process.Dispose()?
                            
                                What is the life cycle of a .net application [closed]
                            
                                Cannot run asp.net 5 from docker
                            
                                Code confusion - why does one work, but not the other?
                            
                                Can I use sign "<-" in Content property of Control?
                            
                                C# string.Substring() or string.Remove() [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With