Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speed up processing from CSV file

I have a project and I have to improve its performance. I have a big Mysql database constructed from a huge CSV file (100 millions rows). Insert time is not a problem, but response time of a request is much important and sometimes a query with 2 join takes around 20 hours...

In a goal to reduce this response time, I tried to migrate my database to Cassandra but without success : my data model is not appropried to Cassandra concepts. Then I would like to try another way to improve performance : Parallel Virutal File System. Instead insert data in Mysql database and send then send a query, I tried to read the whole csv file with multi-threading and did my computing. But the result was not good : 2m20s for only 1 000 000 rows.

For the moment, my computing is very simple : in C++ with the MPI-IO API, I just count the number of differents pair value from 2 columns. To perfom that computing, I use a hashmap where each key is a pair value from csv file. At the end, I return the hashmap size. Here a small code :

 MPI::Init(argc,argv); 
 cout << " INFO init done" << endl;
 int myrank = MPI::COMM_WORLD.Get_rank(); 
 int numprocs = MPI::COMM_WORLD.Get_size(); 
 get_filename(path_name, myrank);
 cout << " INFO open file : " << path_name << endl;
 MPI::File thefile = MPI::File::Open(MPI::COMM_WORLD, path_name.c_str(), 
                  MPI::MODE_RDONLY, 
                  MPI::INFO_NULL); 
 MPI::Offset offset = 101;
 MPI::Offset limit = thefile.Get_size();
 cout << " INFO go computing" << endl;
 do {
   thefile.Read_at(offset, buf, bufsize, MPI_CHAR, status);
   temp.assign(buf);
   Tokenize(temp,tokens,"\n");
   line.assign(tokens.at(0));
   tokens.clear();

   Tokenize(line,tokens,"\t");
   nidt_count(tokens);
   tokens.clear();
   offset += (line.size() + 1);
 }while(offset < limit);
 count = status.Get_count(MPI_INT);
 cout << "process " << myrank << " reads " << nidt_hash.size() << " nidt" << endl; 

I work on a server with 4 cores, 8GB of ram. My data is on a NAS mounted in NFS or Samba on my server. I would be able to add 2 or 3 server for processing, but for the moment I just tried on a small file (1 Million rows) on one server to measure performance.

Finally my questions are :

  • Is it a good way to think to change to PVFS kind for my issue? I would like to say that I will process with more complicated query like : select all rows with a specific date(range hours), and specific pair value from specific columns.
  • Do you know others things that could help me to improve processing from a csv file? I am thinking to use Hadoop, Pytables or FasterCSV.

Here it is a sample of my data composed by 2 csv file :

The biggest one (100 Millions rows) is composed as follow :

ID        DATE             NUM_1        NUM_2     NB_UNITE TYPUNIT CODE_1 CODE_2

0  2007-05-13 15:37:48  33671624244  33698802900    547      s       0      17
0  2007-05-13 15:52:22  33671624244  33672211799      5      s       0      17 
....

The second one is more simple and small (90 000), it is like a dictionary where from a code_1 and code_2 I get a value named CODEVAL:

CODE_1 CODE_2 CODEVAL

  0       17     VS
  0       34     SS

As you expected, usually I create 2 tables one for each file and a typical query is :

Select CODEVAL, hour(date) AS HEURE, COUNT(*) AS NBSMSSOR 
From Tables_1 Join CODEVAL using(CODE_1,CODE_2) 
Where CODEVAL='SS'

Sorry for the presentation, I do not know how to make a array.


Here it is a sample of my data composed by 2 csv file :

  • the biggest one (100 Millions rows) is composed as follow :

    ID DATE NUM_1 NUM_2 NB_UNITE TYPUNIT CODE_1 CODE_2

    0 2007-05-13 15:37:48 33671624244 33698802900 547 s 0 17
    0 2007-05-13 15:52:22 33671624244 33672211799 5 s 0 17 ....

  • the second one is more simple and small (90 000), it's like a dictionary where from a code_1 and code_2 I get a value named CODEVAL:

    CODE_1 CODE_2 CODEVAL

    0 17 VS

    0 34 SS

As you expected, usually I create 2 tables one for each file and a typical query is :

  • Select CODEVAL, hour(date) AS HEURE, COUNT(*) AS NBSMSSOR From Tables_1 Join CODEVAL using(CODE_1,CODE_2) Where CODEVAL='SS'

Sorry for the presentation, I do not know how to make a array.

like image 457
C. Oran Avatar asked May 22 '11 19:05

C. Oran


1 Answers

It looks to me like you're I/O bound. It doesn't help that your data is over a network. I suspect that if you just add more machines then your performance will go DOWN because of the extra contention. Remember that there's still just one spindle and just one HD head reading your data. For the MPI solution I'd suggest making multiple copies of the data and putting them on the servers themselves.

For MySQL, I hear what you're saying. I found MySQL to be very inefficient with joins. I looks to me like it does full-table scans when it could get away without them. I remember MySQL taking over a minute on a query that Oracle would take less than a second. Maybe try PostgreSQL? I'm not sure if it's any better. Another approach could be to have the db sort the data for you so that you can then do the scan without a hashmap.

Unless your records are ginormous, 100M records shouldn't be that bad.

like image 171
Adam Avatar answered Oct 20 '22 04:10

Adam