Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Should I use a JOIN function or run several queries in a loop structure?

I have this 2 mysql tables: TableA and TableB

TableA
* ColumnAId
* ColumnA1
* ColumnA2
TableB
* ColumnBId
* ColumnAId
* ColumnB1
* ColumnB2

In PHP, I wanted to have this multidimensional array format

$array = array(
    array(
        'ColumnAId' => value,
        'ColumnA1' => value,
        'ColumnA2' => value,
        'TableB' => array(
            array(
                'ColumnBId' => value,
                'ColumnAId' => value,
                'ColumnB1' => value,
                'ColumnB2' => value
            )
        )
    )
);

so that I can loop it in this way

foreach($array as $i => $TableA) {
    echo 'ColumnAId' . $TableA['ColumnAId'];
    echo 'ColumnA1' . $TableA['ColumnA1'];
    echo 'ColumnA2' . $TableA['ColumnA2'];
    echo 'TableB\'s';
    foreach($value['TableB'] as $j => $TableB) {
        echo $TableB['...']...
        echo $TableB['...']...
    }
}

My problem is that, what is the best way or the proper way of querying MySQL database so that I can achieve this goal?

Solution1 --- The one I'm using

$array = array();
$rs = mysqli_query("SELECT * FROM TableA", $con);
while ($row = mysqli_fetch_assoc($rs)) {
    $rs2 = mysqli_query("SELECT * FROM Table2 WHERE ColumnAId=" . $row['ColumnAId'], $con);
    // $array = result in array
    $row['TableB'] = $array2;
}

I'm doubting my code cause its always querying the database.

Solution2

$rs = mysqli_query("SELECT * FROM TableA JOIN TableB ON TableA.ColumnAId=TableB.ColumnAId");
while ($row = mysqli_fet...) {
    // Code
}

The second solution only query once, but if I have thousand of rows in TableA and thousand of rows in TableB for each TableB.ColumnAId (1 TableA.ColumnAId = 1000 TableB.ColumnAId), thus this solution2 takes much time than the solution1?

like image 549
iSa Avatar asked Aug 16 '13 13:08

iSa


People also ask

Is join faster than two queries?

I won't leave you in suspense, between Joins and Subqueries, joins tend to execute faster. In fact, query retrieval time using joins will almost always outperform one that employs a subquery. The reason is that joins mitigate the processing burden on the database by replacing multiple queries with one join query.

Should I avoid joins?

Joins are slow, avoid them if possible. You cannot avoid joins in all cases, joins are necessary for some tasks. If you want help with some query optimizing, please provide more details. Everything matters: query, data, indexes, plan, etc.

What is a reason to use joins?

Joins are a more static way to combine data. Joins must be defined between physical tables up front, before analysis, and can't be changed without impacting all sheets using that data source. Joined tables are always merged into a single table.

Which joins are more efficient?

TLDR: The most efficient join is also the simplest join, 'Relational Algebra'. If you wish to find out more on all the methods of joins, read further. Relational algebra is the most common way of writing a query and also the most natural way to do so.


3 Answers

Neither of the two solutions proposed are probably optimal, BUT solution 1 is UNPREDICTABLE and thus INHERENTLY FLAWED!

One of the first things you learn when dealing with large databases is that 'the best way' to do a query is often dependent upon factors (referred to as meta-data) within the database:

  • How many rows there are.
  • How many tables you are querying.
  • The size of each row.

Because of this, there's unlikely to be a silver bullet solution for your problem. Your database is not the same as my database, you will need to benchmark different optimizations if you need the best performance available.

You will probably find that applying & building correct indexes (and understanding the native implementation of indexes in MySQL) in your database does a lot more for you.

There are some golden rules with queries which should rarely be broken:

  • Don't do them in loop structures. As tempting as it often is, the overhead on creating a connection, executing a query and getting a response is high.
  • Avoid SELECT * unless needed. Selecting more columns will significantly increase overhead of your SQL operations.
  • Know thy indexes. Use the EXPLAIN feature so that you can see which indexes are being used, optimize your queries to use what's available and create new ones.

Because of this, of the two I'd go for the second query (replacing SELECT * with only the columns you want), but there are probably better ways to structure the query if you have the time to optimize.

However, speed should NOT be your only consideration in this, there is a GREAT reason not to use suggestion one:

PREDICTABILITY: why read-locks are a good thing

One of the other answers suggests that having the table locked for a long period of time is a bad thing, and that therefore the multiple-query solution is good.

I would argue that this couldn't be further from the truth. In fact, I'd argue that in many cases the predictability of running a single locking SELECT query is a greater argument FOR running that query than the optimization & speed benefits.

First of all, when we run a SELECT (read-only) query on a MyISAM or InnoDB database (default systems for MySQL), what happens is that the table is read-locked. This prevents any WRITE operations from happening on the table until the read-lock is surrendered (either our SELECT query completes or fails). Other SELECT queries are not affected, so if you're running a multi-threaded application, they will continue to work.

This delay is a GOOD thing. Why, you may ask? Relational data integrity.

Let's take an example: we're running an operation to get a list of items currently in the inventory of a bunch of users on a game, so we do this join:

SELECT * FROM `users` JOIN `items` ON `users`.`id`=`items`.`inventory_id` WHERE `users`.`logged_in` = 1;

What happens if, during this query operation, a user trades an item to another user? Using this query, we see the game state as it was when we started the query: the item exists once, in the inventory of the user who had it before we ran the query.

But, what happens if we're running it in a loop?

Depending on whether the user traded it before or after we read his details, and in which order we read the inventory of the two players, there are four possibilities:

  1. The item could be shown in the first user's inventory (scan user B -> scan user A -> item traded OR scan user B -> scan user A -> item traded).
  2. The item could be shown in the second user's inventory (item traded -> scan user A -> scan user B OR item traded -> scan user B -> scan user A).
  3. The item could be shown in both inventories (scan user A -> item traded -> scan user B).
  4. The item could be shown in neither of the user's inventories (scan user B -> item traded -> scan user A).

What this means is that we would be unable to predict the results of the query or to ensure relational integrity.

If you're planning to give $5,000 to the guy with item ID 1000000 at midnight on Tuesday, I hope you have $10k on hand. If your program relies on unique items being unique when snapshots are taken, you will possibly raise an exception with this kind of query.

Locking is good because it increases predictability and protects the integrity of results.

Note: You could force a loop to lock with a transaction, but it will still be slower.

Oh, and finally, USE PREPARED STATEMENTS!

You should never have a statement that looks like this:

mysqli_query("SELECT * FROM Table2 WHERE ColumnAId=" . $row['ColumnAId'], $con);

mysqli has support for prepared statements. Read about them and use them, they will help you to avoid something terrible happening to your database.

like image 140
Glitch Desire Avatar answered Oct 04 '22 01:10

Glitch Desire


Definitely second way. Nested query is an ugly thing since you're getting all query overheads (execution, network e t.c.) every time for every nested query, while single JOIN query will be executed once - i.e. all overheads will be done only once.

Simple rule is not to use queries in cycles - in general. There could be exceptions, if one query will be too complex, so due to performance in should be split, but in a certain case that can be shown only by benchmarks and measures.

like image 38
Alma Do Avatar answered Oct 04 '22 02:10

Alma Do


If you want to do algorithmic evaluation of your data in your application code (which I think is the right thing to do), you should not use SQL at all. SQL was made to be a human readable way to ask for computational achieved data from a relational database, which means, if you just use it to store data, and do the computations in your code, you're doing it wrong anyway.

In such a case you should prefer using a different storage/retrieving possibility like a key-value store (there are persistent ones out there, and newer versions of MySQL exposes a key-value interface as well for InnoDB, but it's still using a relational database for key-value storage, aka the wrong tool for the job).

If you STILL want to use your solution:

Benchmark.

I've often found that issuing multiple queries can be faster than a single query, because MySQL has to parse less query, the optimizer has less work to do, and more often than not the MySQL optimzer just fails (that's the reason things like STRAIGHT JOIN and index hints exist). And even if it does not fail, multiple queries might still be faster depending on the underlying storage engine as well as how many threads try to access the data at once (lock granularity - this only applies with mixing in update queries though - neither MyISAM nor InnoDB lock the whole table for SELECT queries by default). Then again, you might even get different results with the two solutions if you don't use transactions, as data might change between queries if you use multiple queries versus a single one.

In a nutshell: There's way more to your question than what you posted/asked for, and what a generic answer can provide.

Regarding your solutions: I'd prefer the first solution if you have an environment where a) data changes are common and/or b) you have many concurrent running threads (requests) accessing and updating your tables (lock granularity is better with split up queries, as is cacheability of the queries) ; if your database is on a different network, e.g. network latency is an issue, you're probably better of with the first solution (but most people I know have MySQL on the same server, using socket connections, and local socket connections normally don't have much latency).

Situation may also change depending on how often the for loop is actually executed.

Again: Benchmark


Another thing to consider is memory efficiency and algorithmic efficiency. Later one is about O(n) in both cases, but depending on the type of data you use to join, it could be worse in any of the two. E.g. if you use strings to join (you really shouldn't, but you don't say), performance in the more php dependent solution also depends on php hash map algorithm (arrays in php are effectively hash maps) and the likelyhood of a collision, while mysql string indexes are normally fixed length, and thus, depending on your data, might not be applicable.

For memory efficiency, the multi query version is certainly better, as you have the php array anyway (which is very inefficient in terms of memory!) in both solutions, but the join might use a temp table depending on several circumstances (normally it shouldn't, but there ARE cases where it does - you can check using EXPLAIN for your queries)

like image 39
griffin Avatar answered Oct 04 '22 02:10

griffin