Overview of My Situation:
My task is to read strings from a file, and re-format them to a more useful format. After reformatting the input, I have to write it to a output file.
Here is an Example of what has to be done. Example of File Line :
ANO=2010;CPF=17834368168;YEARS=2010;2009;2008;2007;2006 <?xml version='1.0' encoding='ISO-8859-1'?><QUERY><RESTITUICAO><CPF>17834368168</CPF><ANO>2010</ANO><SITUACAODECLARACAO>Sua declaração não consta na base de dados da Receita Federal</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><RESTITUICAO><CPF>17834368168</CPF><ANO>2009</ANO><SITUACAODECLARACAO>Sua declaração não consta na base de dados da Receita Federal</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><RESTITUICAO><CPF>17834368168</CPF><ANO>2008</ANO><SITUACAODECLARACAO>Sua declaração não consta na base de dados da Receita Federal</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><RESTITUICAO><CPF>17834368168</CPF><ANO>2007</ANO><SITUACAODECLARACAO>Sua declaração consta como Pedido de Regularização(PR), na base de dados da Secretaria da Receita Federal do Brasil</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><RESTITUICAO><CPF>17834368168</CPF><ANO>2006</ANO><SITUACAODECLARACAO>Sua declaração não consta na base de dados da Receita Federal</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><STATUS><RESULT>TRUE</RESULT><MESSAGE></MESSAGE></STATUS></QUERY>
This input file has on each line two important informations: CPF
which is the document number I will use, and the XML
file (that represents the return of a query for the document on a database).
What I Must Achieve:
Each Document, in this old format
has an XML
containing the query returns for all the years (2006 to 2010). After reformatting it, each input line is converted to 5 output lines :
CPF=17834368168;YEARS=2010; <?xml version='1.0' encoding='ISO-8859-1'?><QUERY><RESTITUICAO><CPF>17834368168</CPF><ANO>2010</ANO><SITUACAODECLARACAO>Sua declaração não consta na base de dados da Receita Federal</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><STATUS><RESULT>TRUE</RESULT><MESSAGE></MESSAGE></STATUS></QUERY>
CPF=17834368168;YEARS=2009; <?xml version='1.0' encoding='ISO-8859-1'?><QUERY><RESTITUICAO><CPF>17834368168</CPF><ANO>2009</ANO><SITUACAODECLARACAO>Sua declaração não consta na base de dados da Receita Federal</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><STATUS><RESULT>TRUE</RESULT><MESSAGE></MESSAGE></STATUS></QUERY>
CPF=17834368168;YEARS=2008; <?xml version='1.0' encoding='ISO-8859-1'?><QUERY><RESTITUICAO><CPF>17834368168</CPF><ANO>2008</ANO><SITUACAODECLARACAO>Sua declaração não consta na base de dados da Receita Federal</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><STATUS><RESULT>TRUE</RESULT><MESSAGE></MESSAGE></STATUS></QUERY>
CPF=17834368168;YEARS=2007; <?xml version='1.0' encoding='ISO-8859-1'?><QUERY><RESTITUICAO><CPF>17834368168</CPF><ANO>2007</ANO><SITUACAODECLARACAO>Sua declaração consta como Pedido de Regularização(PR), na base de dados da Secretaria da Receita Federal do Brasil</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><STATUS><RESULT>TRUE</RESULT><MESSAGE></MESSAGE></STATUS></QUERY>
CPF=17834368168;YEARS=2006; <?xml version='1.0' encoding='ISO-8859-1'?><QUERY><RESTITUICAO><CPF>17834368168</CPF><ANO>2006</ANO><SITUACAODECLARACAO>Sua declaração não consta na base de dados da Receita Federal</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><STATUS><RESULT>TRUE</RESULT><MESSAGE></MESSAGE></STATUS></QUERY>
One line, containing each year information about that document. So basically, the output files are 5 times longer than the input files.
Performance Issue:
Each file has 400,000 lines, and I have 133 files to process.
At the moment, here is the flow of my app :
Each input file is about 700MB, and it is taking forever to read files and write the converted version of them to another one. A file with 400KB takes ~30 seconds to achieve the process.
Extra Information:
My machine runs on a Intel i5 processor, with 8GB RAM.
I am not instantiating tons of object to avoid mem. leaking, and I'm using the using
clause on input file opening.
What can I do to make it run faster ?
I don't know what your code looks like, but here's an example which on my box (admittedly with an SSD and an i7, but...) processes a 400K file in about 50ms.
I haven't even thought about optimizing it - I've written it in the cleanest way I could. (Note that it's all lazily evaluated; File.ReadLines
and File.WriteAllLines
take care of opening and closing the files.)
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Linq;
class Test
{
public static void Main()
{
Stopwatch stopwatch = Stopwatch.StartNew();
var lines = from line in File.ReadLines("input.txt")
let cpf = ParseCpf(line)
let xml = ParseXml(line)
from year in ParseYears(line)
select cpf + year + xml;
File.WriteAllLines("output.txt", lines);
stopwatch.Stop();
Console.WriteLine("Completed in {0}ms", stopwatch.ElapsedMilliseconds);
}
// Returns the CPF, in the form "CPF=xxxxxx;"
static string ParseCpf(string line)
{
int start = line.IndexOf("CPF=");
int end = line.IndexOf(";", start);
// TODO: Validation
return line.Substring(start, end + 1 - start);
}
// Returns a sequence of year values, in the form "YEAR=2010;"
static IEnumerable<string> ParseYears(string line)
{
// First year.
int start = line.IndexOf("YEARS=") + 6;
int end = line.IndexOf(" ", start);
// TODO: Validation
string years = line.Substring(start, end - start);
foreach (string year in years.Split(';'))
{
yield return "YEARS=" + year + ";";
}
}
// Returns all the XML from the leading space onwards
static string ParseXml(string line)
{
int start = line.IndexOf(" <?xml");
// TODO: Validation
return line.Substring(start);
}
}
This looks like a fine candidate for pipelining.
The basic idea is to have 3 concurrent Tasks, one for each "stage" in the pipeline, communicating with each other through queues (BlockingCollection):
Ideally, the task 1 should not wait for task 2 to finish before going to the next file.
You can even go berserk and put each individual file's pipeline into a separate parallel task, but that would trash your HDD's head so badly it'll probably hurt more than help. For an SSD, on the other hand, this might actually be justified - in any case measure before making a decision.
Using John Skeet's single-threaded implementation as base, here is how a pipelined version would look like (working example):
class Test {
struct Queue2Element {
public string CPF;
public List<string> Years;
public string XML;
}
public static void Main() {
Stopwatch stopwatch = Stopwatch.StartNew();
var queue1 = new BlockingCollection<string>();
var task1 = new Task(
() => {
foreach (var line in File.ReadLines("input.txt"))
queue1.Add(line);
queue1.CompleteAdding();
}
);
var queue2 = new BlockingCollection<Queue2Element>();
var task2 = new Task(
() => {
foreach (var line in queue1.GetConsumingEnumerable())
queue2.Add(
new Queue2Element {
CPF = ParseCpf(line),
XML = ParseXml(line),
Years = ParseYears(line).ToList()
}
);
queue2.CompleteAdding();
}
);
var task3 = new Task(
() => {
var lines =
from element in queue2.GetConsumingEnumerable()
from year in element.Years
select element.CPF + year + element.XML;
File.WriteAllLines("output.txt", lines);
}
);
task1.Start();
task2.Start();
task3.Start();
Task.WaitAll(task1, task2, task3);
stopwatch.Stop();
Console.WriteLine("Completed in {0}ms", stopwatch.ElapsedMilliseconds);
}
// Returns the CPF, in the form "CPF=xxxxxx;"
static string ParseCpf(string line) {
int start = line.IndexOf("CPF=");
int end = line.IndexOf(";", start);
// TODO: Validation
return line.Substring(start, end + 1 - start);
}
// Returns a sequence of year values, in the form "YEAR=2010;"
static IEnumerable<string> ParseYears(string line) {
// First year.
int start = line.IndexOf("YEARS=") + 6;
int end = line.IndexOf(" ", start);
// TODO: Validation
string years = line.Substring(start, end - start);
foreach (string year in years.Split(';')) {
yield return "YEARS=" + year + ";";
}
}
// Returns all the XML from the leading space onwards
static string ParseXml(string line) {
int start = line.IndexOf(" <?xml");
// TODO: Validation
return line.Substring(start);
}
}
As it turns out, the parallel version above is only slightly faster than the serial version. Apparently, the task is more I/O bound than anything else, so pipelining does not help much. If you increase the amount of processing (e.g. add a robust verification) this might change the situation in favor of parallelism, but for now you are probably best-off just concentrating on serial improvements (as John Skeet himself noted, the code is not as fast as it can be).
(Also, I tested with cached files - I wonder if there is a way to clear Windows file cache and see if hardware I/O queue of depth 2 would allow the hard disk to optimize head movements compared to I/O depth 1 of the serial version.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With