I have a project and I have to improve its performance. I have a big Mysql database constructed from a huge CSV file (100 millions rows). Insert time is not a problem, but response time of a request is much important and sometimes a query with 2 join takes around 20 hours...
In a goal to reduce this response time, I tried to migrate my database to Cassandra but without success : my data model is not appropried to Cassandra concepts. Then I would like to try another way to improve performance : Parallel Virutal File System. Instead insert data in Mysql database and send then send a query, I tried to read the whole csv file with multi-threading and did my computing. But the result was not good : 2m20s for only 1 000 000 rows.
For the moment, my computing is very simple : in C++ with the MPI-IO API, I just count the number of differents pair value from 2 columns. To perfom that computing, I use a hashmap where each key is a pair value from csv file. At the end, I return the hashmap size. Here a small code :
MPI::Init(argc,argv);
cout << " INFO init done" << endl;
int myrank = MPI::COMM_WORLD.Get_rank();
int numprocs = MPI::COMM_WORLD.Get_size();
get_filename(path_name, myrank);
cout << " INFO open file : " << path_name << endl;
MPI::File thefile = MPI::File::Open(MPI::COMM_WORLD, path_name.c_str(),
MPI::MODE_RDONLY,
MPI::INFO_NULL);
MPI::Offset offset = 101;
MPI::Offset limit = thefile.Get_size();
cout << " INFO go computing" << endl;
do {
thefile.Read_at(offset, buf, bufsize, MPI_CHAR, status);
temp.assign(buf);
Tokenize(temp,tokens,"\n");
line.assign(tokens.at(0));
tokens.clear();
Tokenize(line,tokens,"\t");
nidt_count(tokens);
tokens.clear();
offset += (line.size() + 1);
}while(offset < limit);
count = status.Get_count(MPI_INT);
cout << "process " << myrank << " reads " << nidt_hash.size() << " nidt" << endl;
I work on a server with 4 cores, 8GB of ram. My data is on a NAS mounted in NFS or Samba on my server. I would be able to add 2 or 3 server for processing, but for the moment I just tried on a small file (1 Million rows) on one server to measure performance.
Finally my questions are :
Here it is a sample of my data composed by 2 csv file :
The biggest one (100 Millions rows) is composed as follow :
ID DATE NUM_1 NUM_2 NB_UNITE TYPUNIT CODE_1 CODE_2
0 2007-05-13 15:37:48 33671624244 33698802900 547 s 0 17
0 2007-05-13 15:52:22 33671624244 33672211799 5 s 0 17
....
The second one is more simple and small (90 000), it is like a dictionary where from a code_1
and code_2
I get a value named CODEVAL:
CODE_1 CODE_2 CODEVAL
0 17 VS
0 34 SS
As you expected, usually I create 2 tables one for each file and a typical query is :
Select CODEVAL, hour(date) AS HEURE, COUNT(*) AS NBSMSSOR
From Tables_1 Join CODEVAL using(CODE_1,CODE_2)
Where CODEVAL='SS'
Sorry for the presentation, I do not know how to make a array.
Here it is a sample of my data composed by 2 csv file :
the biggest one (100 Millions rows) is composed as follow :
ID DATE NUM_1 NUM_2 NB_UNITE TYPUNIT CODE_1 CODE_2
0 2007-05-13 15:37:48 33671624244 33698802900 547 s 0 17
0 2007-05-13 15:52:22 33671624244 33672211799 5 s 0 17 ....
the second one is more simple and small (90 000), it's like a dictionary where from a code_1 and code_2 I get a value named CODEVAL:
CODE_1 CODE_2 CODEVAL
0 17 VS
0 34 SS
As you expected, usually I create 2 tables one for each file and a typical query is :
Sorry for the presentation, I do not know how to make a array.
It looks to me like you're I/O bound. It doesn't help that your data is over a network. I suspect that if you just add more machines then your performance will go DOWN because of the extra contention. Remember that there's still just one spindle and just one HD head reading your data. For the MPI solution I'd suggest making multiple copies of the data and putting them on the servers themselves.
For MySQL, I hear what you're saying. I found MySQL to be very inefficient with joins. I looks to me like it does full-table scans when it could get away without them. I remember MySQL taking over a minute on a query that Oracle would take less than a second. Maybe try PostgreSQL? I'm not sure if it's any better. Another approach could be to have the db sort the data for you so that you can then do the scan without a hashmap.
Unless your records are ginormous, 100M records shouldn't be that bad.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With