Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance of Writing to File C#

Overview of My Situation:

My task is to read strings from a file, and re-format them to a more useful format. After reformatting the input, I have to write it to a output file.

Here is an Example of what has to be done. Example of File Line :

ANO=2010;CPF=17834368168;YEARS=2010;2009;2008;2007;2006 <?xml version='1.0' encoding='ISO-8859-1'?><QUERY><RESTITUICAO><CPF>17834368168</CPF><ANO>2010</ANO><SITUACAODECLARACAO>Sua declaração não consta na base de dados da Receita Federal</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><RESTITUICAO><CPF>17834368168</CPF><ANO>2009</ANO><SITUACAODECLARACAO>Sua declaração não consta na base de dados da Receita Federal</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><RESTITUICAO><CPF>17834368168</CPF><ANO>2008</ANO><SITUACAODECLARACAO>Sua declaração não consta na base de dados da Receita Federal</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><RESTITUICAO><CPF>17834368168</CPF><ANO>2007</ANO><SITUACAODECLARACAO>Sua declaração consta como Pedido de Regularização(PR), na base de dados da Secretaria da Receita Federal do Brasil</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><RESTITUICAO><CPF>17834368168</CPF><ANO>2006</ANO><SITUACAODECLARACAO>Sua declaração não consta na base de dados da Receita Federal</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><STATUS><RESULT>TRUE</RESULT><MESSAGE></MESSAGE></STATUS></QUERY>

This input file has on each line two important informations: CPF which is the document number I will use, and the XML file (that represents the return of a query for the document on a database).

What I Must Achieve:

Each Document, in this old format has an XML containing the query returns for all the years (2006 to 2010). After reformatting it, each input line is converted to 5 output lines :

CPF=17834368168;YEARS=2010; <?xml version='1.0' encoding='ISO-8859-1'?><QUERY><RESTITUICAO><CPF>17834368168</CPF><ANO>2010</ANO><SITUACAODECLARACAO>Sua declaração não consta na base de dados da Receita Federal</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><STATUS><RESULT>TRUE</RESULT><MESSAGE></MESSAGE></STATUS></QUERY>
CPF=17834368168;YEARS=2009; <?xml version='1.0' encoding='ISO-8859-1'?><QUERY><RESTITUICAO><CPF>17834368168</CPF><ANO>2009</ANO><SITUACAODECLARACAO>Sua declaração não consta na base de dados da Receita Federal</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><STATUS><RESULT>TRUE</RESULT><MESSAGE></MESSAGE></STATUS></QUERY>
CPF=17834368168;YEARS=2008; <?xml version='1.0' encoding='ISO-8859-1'?><QUERY><RESTITUICAO><CPF>17834368168</CPF><ANO>2008</ANO><SITUACAODECLARACAO>Sua declaração não consta na base de dados da Receita Federal</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><STATUS><RESULT>TRUE</RESULT><MESSAGE></MESSAGE></STATUS></QUERY>
CPF=17834368168;YEARS=2007; <?xml version='1.0' encoding='ISO-8859-1'?><QUERY><RESTITUICAO><CPF>17834368168</CPF><ANO>2007</ANO><SITUACAODECLARACAO>Sua declaração consta como Pedido de Regularização(PR), na base de dados da Secretaria da Receita Federal do Brasil</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><STATUS><RESULT>TRUE</RESULT><MESSAGE></MESSAGE></STATUS></QUERY>
CPF=17834368168;YEARS=2006; <?xml version='1.0' encoding='ISO-8859-1'?><QUERY><RESTITUICAO><CPF>17834368168</CPF><ANO>2006</ANO><SITUACAODECLARACAO>Sua declaração não consta na base de dados da Receita Federal</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><STATUS><RESULT>TRUE</RESULT><MESSAGE></MESSAGE></STATUS></QUERY>

One line, containing each year information about that document. So basically, the output files are 5 times longer than the input files.

Performance Issue:

Each file has 400,000 lines, and I have 133 files to process.

At the moment, here is the flow of my app :

  1. Open a file
  2. Read a line
  3. Parse it to the new format
  4. Write the line to the output file
  5. Goto 2 until there is no left line
  6. Goto1 until there is no left file

Each input file is about 700MB, and it is taking forever to read files and write the converted version of them to another one. A file with 400KB takes ~30 seconds to achieve the process.

Extra Information:

My machine runs on a Intel i5 processor, with 8GB RAM.

I am not instantiating tons of object to avoid mem. leaking, and I'm using the using clause on input file opening.

What can I do to make it run faster ?

like image 955
Marcello Grechi Lins Avatar asked Feb 24 '12 20:02

Marcello Grechi Lins


2 Answers

I don't know what your code looks like, but here's an example which on my box (admittedly with an SSD and an i7, but...) processes a 400K file in about 50ms.

I haven't even thought about optimizing it - I've written it in the cleanest way I could. (Note that it's all lazily evaluated; File.ReadLines and File.WriteAllLines take care of opening and closing the files.)

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Linq;

class Test
{
    public static void Main()
    {
        Stopwatch stopwatch = Stopwatch.StartNew();
        var lines = from line in File.ReadLines("input.txt")
                    let cpf = ParseCpf(line)
                    let xml = ParseXml(line)
                    from year in ParseYears(line)
                    select cpf + year + xml;

        File.WriteAllLines("output.txt", lines);
        stopwatch.Stop();
        Console.WriteLine("Completed in {0}ms", stopwatch.ElapsedMilliseconds);
    }

    // Returns the CPF, in the form "CPF=xxxxxx;"
    static string ParseCpf(string line)
    {
        int start = line.IndexOf("CPF=");
        int end = line.IndexOf(";", start);
        // TODO: Validation
        return line.Substring(start, end + 1 - start);
    }

    // Returns a sequence of year values, in the form "YEAR=2010;"
    static IEnumerable<string> ParseYears(string line)
    {
        // First year.
        int start = line.IndexOf("YEARS=") + 6;
        int end = line.IndexOf(" ", start);
        // TODO: Validation
        string years = line.Substring(start, end - start);
        foreach (string year in years.Split(';'))
        {
            yield return "YEARS=" + year + ";";
        }
    }

    // Returns all the XML from the leading space onwards
    static string ParseXml(string line)
    {
        int start = line.IndexOf(" <?xml");
        // TODO: Validation
        return line.Substring(start);
    }
}
like image 153
Jon Skeet Avatar answered Nov 18 '22 23:11

Jon Skeet


This looks like a fine candidate for pipelining.

The basic idea is to have 3 concurrent Tasks, one for each "stage" in the pipeline, communicating with each other through queues (BlockingCollection):

  1. The first task reads the input file line-by-line and puts read lines into the queue.
  2. The second task gets the lines from the queue, formats them and puts the result in another queue.
  3. The third task gets the formatted results from the second queue and writes them to the resulting file.

Ideally, the task 1 should not wait for task 2 to finish before going to the next file.

You can even go berserk and put each individual file's pipeline into a separate parallel task, but that would trash your HDD's head so badly it'll probably hurt more than help. For an SSD, on the other hand, this might actually be justified - in any case measure before making a decision.

--- EDIT ---

Using John Skeet's single-threaded implementation as base, here is how a pipelined version would look like (working example):

class Test {

    struct Queue2Element {
        public string CPF;
        public List<string> Years;
        public string XML;
    }

    public static void Main() {

        Stopwatch stopwatch = Stopwatch.StartNew();

        var queue1 = new BlockingCollection<string>();
        var task1 = new Task(
            () => {
                foreach (var line in File.ReadLines("input.txt"))
                    queue1.Add(line);
                queue1.CompleteAdding();
            }
        );

        var queue2 = new BlockingCollection<Queue2Element>();
        var task2 = new Task(
            () => {
                foreach (var line in queue1.GetConsumingEnumerable())
                    queue2.Add(
                        new Queue2Element {
                            CPF = ParseCpf(line),
                            XML = ParseXml(line),
                            Years = ParseYears(line).ToList()
                        }
                    );
                queue2.CompleteAdding();
            }
        );

        var task3 = new Task(
            () => {
                var lines = 
                    from element in queue2.GetConsumingEnumerable()
                    from year in element.Years
                    select element.CPF + year + element.XML;
                File.WriteAllLines("output.txt", lines);
            }
        );

        task1.Start();
        task2.Start();
        task3.Start();
        Task.WaitAll(task1, task2, task3);

        stopwatch.Stop();
        Console.WriteLine("Completed in {0}ms", stopwatch.ElapsedMilliseconds);

    }

    // Returns the CPF, in the form "CPF=xxxxxx;"
    static string ParseCpf(string line) {
        int start = line.IndexOf("CPF=");
        int end = line.IndexOf(";", start);
        // TODO: Validation
        return line.Substring(start, end + 1 - start);
    }

    // Returns a sequence of year values, in the form "YEAR=2010;"
    static IEnumerable<string> ParseYears(string line) {
        // First year.
        int start = line.IndexOf("YEARS=") + 6;
        int end = line.IndexOf(" ", start);
        // TODO: Validation
        string years = line.Substring(start, end - start);
        foreach (string year in years.Split(';')) {
            yield return "YEARS=" + year + ";";
        }
    }

    // Returns all the XML from the leading space onwards
    static string ParseXml(string line) {
        int start = line.IndexOf(" <?xml");
        // TODO: Validation
        return line.Substring(start);
    }

}

As it turns out, the parallel version above is only slightly faster than the serial version. Apparently, the task is more I/O bound than anything else, so pipelining does not help much. If you increase the amount of processing (e.g. add a robust verification) this might change the situation in favor of parallelism, but for now you are probably best-off just concentrating on serial improvements (as John Skeet himself noted, the code is not as fast as it can be).

(Also, I tested with cached files - I wonder if there is a way to clear Windows file cache and see if hardware I/O queue of depth 2 would allow the hard disk to optimize head movements compared to I/O depth 1 of the serial version.)

like image 5
Branko Dimitrijevic Avatar answered Nov 18 '22 22:11

Branko Dimitrijevic