Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse very large CSV files with C++

My goal is to parse large csv files with C++ in a QT project in OSX environment. (When I say csv I mean tsv and other variants 1GB ~ 5GB ).

It seems like a simple task , but things get complicated when file sizes get bigger. I don't want to write my own parser because of the many edge cases related to parsing csv files.

I have found various csv processing libraries to handle this job, but parsing 1GB file takes about 90 ~ 120 seconds on my machine which is not acceptable. I am not doing anything with the data right now, I just process and discard the data for testing purposes.

cccsvparser is one of the libraries I have tried . But the the only fast enough library was fast-cpp-csv-parser which gives acceptable results: 15 secs on my machine, but it works only when the file structure is known.

Example using: fast-cpp-csv-parser

#include "csv.h"

int main(){
    io::CSVReader<3> in("ram.csv");
    in.read_header(io::ignore_extra_column, "vendor", "size", "speed");
    std::string vendor; int size; double speed;
    while(in.read_row(vendor, size, speed)){
    // do stuff with the data
    }
}

As you can see I cannot load arbitrary files and I must specifically define variables to match my file structure. I'm not aware of any method that allows me to create those variables dynamically in runtime .

The other approach I have tried is to read csv file line by line with fast-cpp-csv-parser LineReader class which is really fast (about 7 secs to read whole file), and then parse each line with cccsvparser lib that can process strings, but this takes about 40 seconds until done, it is an improvement compared to the first attempts but still unacceptable.

I have seen various Stack Overflow questions related to csv file parsing none of them takes large file processing in to account.

Also I spent a lot of time googling to find a solution to this problem, and I really miss the freedom that package managers like npm or pip offer when searching for out of the box solutions.

I will appreciate any suggestion about how to handle this problem.

Edit:

When using @fbucek's approach, processing time reduced to 25 seconds, which is a great improvement.

can we optimize this even more?

like image 951
Alexander Avatar asked Dec 10 '14 12:12

Alexander


2 Answers

I am assuming you are using only one thread.

Multithreading can speedup your process.

Best accomplishment so far is 40 sec. Let's stick to that.

I have assumed that first you read then you process -> ( about 7 secs to read whole file)

7 sec for reading 33 sec for processing

First of all you can divide your file into chunks, let's say 50MB. That means that you can start processing after reading 50MB of file. You do not need to wait till whole file is finished. That's 0.35 sec for reading ( now it is 0.35 + 33 second for processing = cca 34sec )

When you use Multithreading, you can process multiple chunks at a time. That can speedup process theoretically up to number of your cores. Let's say you have 4 cores. That's 33/4 = 8.25 sec.

I think you can speed up you processing with 4 cores up to 9 s. in total.

Look at QThreadPool and QRunnable or QtConcurrent I would prefer QThreadPool

Divide task into parts:

  1. First try to loop over file and divide it into chunks. And do nothing with it.
  2. Then create "ChunkProcessor" class which can process that chunk
  3. Make "ChunkProcessor" a subclass of QRunnable and in reimplemented run() function execute your process
  4. When you have chunks, you have class which can process them and that class is QThreadPool compatible, you can pass it into

It could look like this

loopoverfile {
  whenever chunk is ready {
     ChunkProcessor *chunkprocessor = new ChunkProcessor(chunk);
     QThreadPool::globalInstance()->start(chunkprocessor);
     connect(chunkprocessor, SIGNAL(finished(std::shared_ptr<ProcessedData>)), this, SLOT(readingFinished(std::shared_ptr<ProcessedData>)));
  }   
}

You can use std::share_ptr to pass processed data in order not to use QMutex or something else and avoid serialization problems with multiple thread access to some resource.

Note: in order to use custom signal you have to register it before use

qRegisterMetaType<std::shared_ptr<ProcessedData>>("std::shared_ptr<ProcessedData>");

Edit: (based on discussion, my answer was not clear about that) It does not matter what disk you use or how fast is it. Reading is single thread operation. This solution was suggested only because it took 7 sec to read and again does not matter what disk it is. 7 sec is what's count. And only purpose is to start processing as soon as possible and not to wait till reading is finished.

You can use:

QByteArray data = file.readAll();

Or you can use principal idea: ( I do not know why it take 7 sec to read, what is behind it )

 QFile file("in.txt");
 if (!file.open(QIODevice::ReadOnly | QIODevice::Text))
   return;

 QByteArray* data = new QByteArray;    
 int count = 0;
 while (!file.atEnd()) {
   ++count;
   data->append(file.readLine());
   if ( count > 10000 ) {
     ChunkProcessor *chunkprocessor = new ChunkProcessor(data);
     QThreadPool::globalInstance()->start(chunkprocessor);
     connect(chunkprocessor, SIGNAL(finished(std::shared_ptr<ProcessedData>)), this, SLOT(readingFinished(std::shared_ptr<ProcessedData>)));
     data = new QByteArray; 
     count = 0;
   }
 }

One file, one thread, read almost as fast as read by line "without" interruption. What you do with data is another problem, but has nothing to do with I/O. It is already in memory. So only concern would be 5GB file and ammout of RAM on the machine.

It is very simple solution all you need is subclass QRunnable, reimplement run function, emit signal when it is finished, pass processed data using shared pointer and in main thread joint that data into one structure or whatever. Simple thread safe solution.

like image 175
fbucek Avatar answered Oct 14 '22 06:10

fbucek


I would propose a multi-thread suggestion with a slight variation is that one thread is dedicated to reading file in predefined (configurable) size of chunks and keeps on feeding data to a set of threads (more than one based cpu cores). Let us say that the configuration looks like this:

chunk size = 50 MB
Disk Thread = 1
Process Threads = 5

  1. Create a class for reading data from file. In this class, it holds a data structure which is used to communicate with process threads. For example this structure would contain starting offset, ending offset of the read buffer for each process thread. For reading file data, reader class holds 2 buffers each of chunk size (50 MB in this case)
  2. Create a process class which holds a pointers (shared) for the read buffers and offsets data structure.
  3. Now create driver (probably main thread), creates all the threads and waiting on their completion and handles the signals.
  4. Reader thread is invoked with reader class, reads 50 MB of the data and based on number of threads creates offsets data structure object. In this case t1 handles 0 - 10 MB, t2 handles 10 - 20 MB and so on. Once ready, it notifies processor threads. It then immediately reads next chunk from disk and waits for processor thread to completion notification from process threads.
  5. Processor threads on the notification, reads data from buffer and processes it. Once done, it notifies reader thread about completion and waits for next chunk.
  6. This process completes till the whole data is read and processed. Then reader thread notifies back to the main thread about completion which sends PROCESS_COMPLETION, upon all threads exits. or main thread chooses to process next file in the queue.

Note that offsets are taken for easy explanation, offsets to line delimiter mapping needs to be handled programmatically.

like image 40
bsr Avatar answered Oct 14 '22 05:10

bsr