Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split a huge file into words?

Tags:

c#

.net

file-io

How can I read a very long string from text file, and then process it (split into words)?

I tried the StreamReader.ReadLine() method, but I get an OutOfMemory exception. Apparently, my lines are extremely long. This is my code for reading file:

using (var streamReader = File.OpenText(_filePath))
    {

        int lineNumber = 1;
        string currentString = String.Empty;
        while ((currentString = streamReader.ReadLine()) != null)
        {

            ProcessString(currentString, lineNumber);
            Console.WriteLine("Line {0}", lineNumber);
            lineNumber++;
        }
    }

And the code which splits line into words:

var wordPattern = @"\w+";
var matchCollection = Regex.Matches(text, wordPattern);
var words = (from Match word in matchCollection
             select word.Value.ToLowerInvariant()).ToList();
like image 508
Ihor Korotenko Avatar asked Jul 06 '15 21:07

Ihor Korotenko


People also ask

How do I split large files into parts?

Open the Zip file. Open the Tools tab. Click the Split Size dropdown button and select the appropriate size for each of the parts of the split Zip file. If you choose Custom Size in the Split Size dropdown list, another small window will open and allow you to enter in a custom size specified in megabytes.

How do I split a large text file?

Text File Splitter is a free Windows utility that allows you to split a large text or log file into multiple, smaller files. Smaller files are easier to share via email and usb drives. You can easily install this utility using the provided Windows Installer MSI or the Zip file.

How do I split a large file into smaller windows?

First up, right-click the file you want to split into smaller pieces, then select 7-Zip > Add to Archive. Give your archive a name. Under Split to Volumes, bytes, input the size of split files you want. There are several options in the dropdown menu, although they may not correspond to your large file.


1 Answers

You could read by char, building up words as you go, using yield to make it deferred so you don't have to read the entire file at once:

private static IEnumerable<string> ReadWords(string filename)
{
    using (var reader = new StreamReader(filename))
    {
        var builder = new StringBuilder();

        while (!reader.EndOfStream)
        {
            char c = (char)reader.Read();

            // Mimics regex /w/ - almost.
            if (char.IsLetterOrDigit(c) || c == '_')
            {
                builder.Append(c);
            }
            else
            {
                if (builder.Length > 0)
                {
                    yield return builder.ToString();
                    builder.Clear();
                }
            }
        }

        yield return builder.ToString();
    }
}

The code reads the file by character, and when it encounters a non-word character it will yield return the word built up until then (only for the first non-letter character). The code uses a StringBuilder to build the word string.

Char.IsLetterOrDigit() behaves just as the regex word character w for characters, but underscores (amongst others) also fall into the latter category. If your input contains more characters you also wish to include, you'll have to alter the if().

like image 103
CodeCaster Avatar answered Oct 25 '22 02:10

CodeCaster