I have a 60GB csv file I need to make some modifications to. The customer wants some changes to the files data, but I don't want to regenerate the data in that file because it took 4 days to do.
How can I read the file, line by line (not loading it all into memory!), and make edits to those lines as I go, replacing certain values etc.?
You can do open("data. csv", "rw") , this allows you to read and write at the same time. So will this help me modify the data?
How do you write a line by line in csv? Use write() to write into a CSV file write(str) to write to file with str as the desired data. Each line should be separated by \n to write line by line.
The process would be something like this:
StreamWriter
to a temporary file.StreamReader
to the target file.Note regarding Steps 2 and 3.1: If you are confident in the structure of your file and it is simple enough, you can do all this out of the box as described (I'll include a sample in a moment). However, there are factors in a CSV file that may need attention (such as recognizing when a delimiter is being used literally in a column value). You can drudge through this yourself, or try an existing solution.
Basic example just using StreamReader
and StreamWriter
:
var sourcePath = @"C:\data.csv";
var delimiter = ",";
var firstLineContainsHeaders = true;
var tempPath = Path.GetTempFileName();
var lineNumber = 0;
var splitExpression = new Regex(@"(" + delimiter + @")(?=(?:[^""]|""[^""]*"")*$)");
using (var writer = new StreamWriter(tempPath))
using (var reader = new StreamReader(sourcePath))
{
string line = null;
string[] headers = null;
if (firstLineContainsHeaders)
{
line = reader.ReadLine();
lineNumber++;
if (string.IsNullOrEmpty(line)) return; // file is empty;
headers = splitExpression.Split(line).Where(s => s != delimiter).ToArray();
writer.WriteLine(line); // write the original header to the temp file.
}
while ((line = reader.ReadLine()) != null)
{
lineNumber++;
var columns = splitExpression.Split(line).Where(s => s != delimiter).ToArray();
// if there are no headers, do a simple sanity check to make sure you always have the same number of columns in a line
if (headers == null) headers = new string[columns.Length];
if (columns.Length != headers.Length) throw new InvalidOperationException(string.Format("Line {0} is missing one or more columns.", lineNumber));
// TODO: search and replace in columns
// example: replace 'v' in the first column with '\/': if (columns[0].Contains("v")) columns[0] = columns[0].Replace("v", @"\/");
writer.WriteLine(string.Join(delimiter, columns));
}
}
File.Delete(sourcePath);
File.Move(tempPath, sourcePath);
memory-mapped files is a new feature in .NET Framework 4 which can be used to edit large files. read here http://msdn.microsoft.com/en-us/library/dd997372.aspx or google Memory-mapped files
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With