Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CsvHelper - Reading Stream Asynchronously

I have a service that takes an input Stream containing CSV data that needs to be bulk-inserted into a database, and my application is using async/await wherever possible.

The process is: Parse stream using CsvHelper's CsvParser, add each row to DataTable, use SqlBulkCopy to copy the DataTable to the database.

The data could be any size so I'd like to avoid reading the whole thing into memory at one time - obviously I'll have all that data in the DataTable by the end anyway so would essentially have 2 copies in memory.

I would like to do all of this as asynchronously as possible, but CsvHelper doesn't have any async methods so I've come up with the following workaround:

using (var inputStreamReader = new StreamReader(inputStream))
{
    while (!inputStreamReader.EndOfStream)
    {
        // Read line from the input stream
        string line = await inputStreamReader.ReadLineAsync();

        using (var memoryStream = new MemoryStream())
        using (var streamWriter = new StreamWriter(memoryStream))
        using (var memoryStreamReader = new StreamReader(memoryStream))
        using (var csvParser = new CsvParser(memoryStreamReader))
        {
            await streamWriter.WriteLineAsync(line);
            await streamWriter.FlushAsync();

            memoryStream.Position = 0;

            // Loop through all the rows (should only be one as we only read a single line...)
            while (true)
            {
                var row = csvParser.Read();

                // No more rows to process
                if (row == null)
                {
                    break;
                }

                // Add row to DataTable
            }
        }
    }
}

Are there any issues with this solution? Is it even necessary? I've seen that the CsvHelper devs specifically did not add async functionality (https://github.com/JoshClose/CsvHelper/issues/202) but I don't really follow the reasoning behind not doing so.

EDIT: I've just realised that this solution isn't going to work for instances where a column contains a line break anyway :( Guess I'll just have to copy the whole input Stream to a MemoryStream or something

EDIT2: Some more information.

This is in an async method in a library where I am trying to do async all the way down. It'll likely be consumed by an MVC controller (if I just wanted to offload it from a UI thread I would just Task.Run() it). Mostly the method will be waiting on external sources such as a database / DFS, and I would like for the thread to be freed while it is.

CsvParser.Read() is going to block even if what's blocking is reading the Stream (e.g. if the data I'm attempting to read resides on a server on the other side of the world), whereas if CsvHelper were to implement an async method that uses TextReader.ReadAsync(), then I wouldn't be blocked waiting for my data to arrive from Dubai. As far as I can tell, I'm not asking for an async wrapper around a synchronous method.

EDIT3: Update from far in the future! Async functionality was actually added to CsvHelper back in 2017. I hope someone at the company I was working for has upgraded to the newer version since then!

like image 694
Lykaios Avatar asked May 06 '16 02:05

Lykaios


2 Answers

Eric lippert explained the usefulness of async-await using a metaphor of cooking a meal in a restaurant. According to his explanation it is not useful to do something asynchronously if your thread has nothing else to do.

Also, be aware that while your thread is doing something it cannot do something else. Only if your thread is waiting for something it can do other things. One of the things you wait for in your process is the reading of a file. While the thread is reading the file line by line, it has to wait several times for lines to be read. During this waiting it could do other things, like parsing the read CSV-data and sending the parsed data to your destination.

Parsing the data is not a process where your thread has to wait for some other process to finish, like it has to do when reading a file or sending data to a database. That's why there is no async version of the parsing process. A normal async-await wouldn't help keeping your thread busy, because during the parsing process there is nothing to await for, so during the parsing your thread wouldn't have time to do something else.

You could of course convert the parsing process to an awaitable task using Task.Run ( () => ParseReadData(...)), and await for this task to finish, but in the analogy of Eric Lippert's restaurant this would be defrosting a cook to do the job, while you are sitting behind the counter doing nothing.

However, if your thread has something meaningful to do, while the read CSV-data is being parsed, like responding to user input, then it might be useful to start the parsing in a separate task.

If your complete reading - parsing - updating database process doesn't need interaction with the user, but you need your thread to be free to do other things while doing the process, consider putting the complete process in a separate task, and start the task without awaiting for it. In that case you only use your interface thread to start the other task, and your interface thread is free to do other things. Starting this new task is a relatively small cost in comparison to the total time of your process.

Once again: if your thread has nothing else to do, let this thread do the processing, don't start other tasks to do it.

like image 136
Harald Coppoolse Avatar answered Sep 19 '22 13:09

Harald Coppoolse


Here is a good article on exposing async wrappers on sync methods, and why CsvHelper didn't do it. http://blogs.msdn.com/b/pfxteam/archive/2012/03/24/10287244.aspx

If you don't want to block the UI thread, run the processing on a background thread.

CsvHelper pulls in a buffer of data. The size of the buffer is a setting that you can change if you like. If your server is on the other side of the world, it'll buffer some data, then read it. More than likely, it'll take several reads before the buffer is used.

CsvHelper also yields records, so if you don't actually get a row, nothing is read. If you only read a couple rows, only that much of the file is read (actually the buffer size).

If you're worried about memory, there are a couple simple options.

  1. Buffer the data. You can bulk copy in 100 or 1000 rows at a time instead of the whole file. Just keep doing that until the file is done.
  2. Use a FileStream. If you need to read the whole file at once for some reason, use a FileStream instead and write the whole thing to disc. It will be slower, but you won't be using a bunch of memory.
like image 27
Josh Close Avatar answered Sep 19 '22 13:09

Josh Close