I have .txt file (contains more than million rows) which is around 1GB and I have one list of string, I am trying to remove all the rows from the file that exist in the list of strings and creating new file but it is taking long long time.
using (StreamReader reader = new StreamReader(_inputFileName))
{
using (StreamWriter writer = new StreamWriter(_outputFileName))
{
string line;
while ((line = reader.ReadLine()) != null)
{
if (!_lstLineToRemove.Contains(line))
writer.WriteLine(line);
}
}
}
How can I enhance the performance of my code?
You may get some speedup by using PLINQ to do the work in parallel, also switching from a list to a hash set will also greatly speed up the Contains(
check. HashSet
is thread safe for read-only operations.
private HashSet<string> _hshLineToRemove;
void ProcessFiles()
{
var inputLines = File.ReadLines(_inputFileName);
var filteredInputLines = inputLines.AsParallel().AsOrdered().Where(line => !_hshLineToRemove.Contains(line));
File.WriteAllLines(_outputFileName, filteredInputLines);
}
If it does not matter that the output file be in the same order as the input file you can remove the .AsOrdered()
and get some additional speed.
Beyond this you are really just I/O bound, the only way to make it any faster is to get faster drives to run it on.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With