Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MySQL insert, faster in PHP than C++, is this expected?

Recently I have been tasked with doing a few speed checks so I can tell whether is faster to use php/php-cli or c++ to insert a certain number of rows into a database.

Before we start, let me tell you a few details so everything is clear:

  • The php part is run through Apache, directly requested in the browser.
  • The hard disk drive tests are being run on is an SSD drive. I guess things would be slower in regular drives. The machine itself is nothing special, six years old or so.
  • All inserts are being done via prepared statements. We're using mysqli on php and the mysqlcppconn (mysql c++ connector, provided by Oracle).
  • All inserts are done entry by entry. I know we can stack them but well, we're testing here.
  • Times are displayed via microtime in php and via header in c++.
  • The code itself is not equivalent, of course. More on that later.
  • All text is in UTF-8. There's russian, chinese, arabic, spanish, english and all kinds of crazy stuff in there. The mysql table is in utf8_4mb.
  • The numbers for the c++ code are the results of using an std::vector and -O2 levels compiling with g++ (vectors outperformed maps, unordered_maps and std::arrays).

So, this is the process:

  • Connect to the database.
  • Open a text file with N lines.
  • Read a line of the file.
  • Split the line on a separator character.
  • Use certain parts of the split line to get insert values (say, the 0th, 1st and 3rd indexes).
  • Send these parts to the prepared statement to insert them.
  • Repeat until the file is completely read.

Both codes work exactly as expected. Here are the resulting numbers:

php:

  • 5000 entries: 1.42 - 1.27 sec.
  • 20000 entries: 5.53 - 6.18 sec.
  • 50000 entries: 14.43 - 15.69 sec.

c++:

  • 5000 entries: 1.78 - 1.81 sec.
  • 20000 entries: 7.19 - 7.22 sec.
  • 50000 entries: 18.52 - 18.84 sec.

php outperforms c++ as the lines in the file increase... At first, I suspected of the line splitting function: the splitting in php is done with "explode". The algorithm is as naive as it comes for c++... The container is passed via reference and its contents are changed on the fly. The container is traversed only once. I made sure the container "reserves()" all neccesary space (remember, I finally choose vectors) that is fixed. The container is created on the main function and then is passed by reference through the code. It is never emptied or resized: only its contents change.

template<typename container> void explode(const std::string& p_string, const char p_delimiter, container& p_result)
{
    auto it=p_result.begin();
    std::string::const_iterator beg=p_string.begin(), end=p_string.end();
    std::string temp;

    while(beg < end)
    {
        if( (*beg)==p_delimiter)
        {
            *(it)=temp;
            ++it;
            temp="";
        }
        else
        {
            temp+=*beg;
        }

        ++beg;
    }

    *(it)=temp;
}

As said before, the task performed is equivalent, but the code generating it is not. C++ code has the usual try-catch blocks for controlling the mysql interactions. As for the rest, the main loop runs until EOF is reached and every iteration checks if the insertion failed (both in c++ and php).

I have seen c++ greatly outperforming php in working with files and their contents so I expected the same to be applicable here. Somehow I suspect of the splitting algorithm but maybe it is just that the database connector is slower (still, when I disabled database interaction php still processed faster) or my code is sub par...

As far as profiling goes, gprof spat this out about the c++ code:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ns/call  ns/call  name    
 60.00      0.03     0.03    50000   600.00   600.00  void anc_str::explotar_cadena<std::vector<std::string, std::allocator<std::string> > >(std::string const&, char, std::vector<std::string, std::allocator<std::string> >&)
 40.00      0.05     0.02                             insertar(sql::PreparedStatement*, std::string const&, std::vector<std::string, std::allocator<std::string> >&)
  0.00      0.05     0.00        1     0.00     0.00  _GLOBAL__sub_I__ZN7anc_str21obtener_linea_archivoERSt14basic_ifstreamIcSt11char_traitsIcEE

Where "explotar_cadena" is "explode" and "insertar" is "split this line and set the prepared statement up". As you can see 60% of the time is spend there (not surprising... it runs 50000 times and does this crazy splitting thing). "obtener_linea_archivo" is just "please, dump the next line into the string".

Without mysql interaction (just load the file, read the lines and split them) I get these measurements:

php

  • 5000 entries: 0.019 - 0.036 sec.
  • 20000 entries: 0.09 - 0.10 sec.
  • 50000 entries: 0.14 - 0.17 sec.

c++

  • 5000 entries: 0.07 - 0.10 sec.
  • 20000 entries: 0.25 - 0.26 sec.
  • 50000 entries: 0.49 - 0.55 sec.

Okay, both times are good and hardly noticeable for real life terms still, I am surprised... So the question here is: Am I supposed to expect this?. Anyone with prior experience willing to lend a hand?.

Thanks in advance.

Edit: Here is a quick link to a stripped down version containing input files, C++ code and php code [ http://www.datafilehost.com/d/d31034d6 ]. Notice that there is no sql interaction: only file opening, string splitting and time measuring. Please, forgive the butchered code and half spanish comments and variable names as this was done in a hurry. Also, note the gprof results above: I am no expert but I think we're trying to find a better way of splitting the string.

like image 951
The Marlboro Man Avatar asked Nov 10 '22 22:11

The Marlboro Man


1 Answers

Some part of it might to have to do with the driver/interface used in each language. For example, with PHP/MySQL, you will probably find that mysqli is faster than mysql, which is faster than PDO. That is because the libraries are progressively more abstract (or less maintained). You might try profiling the queries themselves on the database server to see if there is any difference in execution time. Then again, there may be more going on, as other commenters have noted.

like image 100
Ixalmida Avatar answered Nov 14 '22 22:11

Ixalmida