Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the optimal way to store binary flags / boolean values in each database engine?

I've seen some possible approaches (in some database engines some of them are synonyms):

  1. TINYINT(1)
  2. BOOL
  3. BIT(1)
  4. ENUM(0,1)
  5. CHAR(0) NULL

All major database engine supported by PHP should be noted, but just as a refference it'll be even better if also other engines will be noted.

I'm asking for a design that is best optimized for reading. e.g. SELECTing with the flag field in the WHERE condition, or GROUP BY the flag. Performance is much more important than storage space (except when the size has an impact on performance).

And some more details:

While creating the table I can't know if it'll be sparse (if most flags are on or off), but I can ALTER the tables later on, so if there is something I can optimize if I know that, it should be noted.

Also if it's make a difference if there is only one flag (or a few) per row, versus many (or a lot of) flags it should be noted.

BTW, I've read somewhere in SO the following:

Using boolean may do the same thing as using tinyint, however it has the advantage of semantically conveying what your intention is, and that's worth something.

Well, in my case it doesn't worth nothing, because each table is represented by a class in my application and everything is explicitly defined in the class and well documented.

like image 735
xun Avatar asked Dec 26 '10 22:12

xun


People also ask

How do you use Boolean flags?

 A Flag is a boolean variable that signals when some condition exists in a program.  When a flag is set to true, it means some condition exists  When a flag is set to false, it means some condition does not exist. if(score > 95) highscore = true;  Here, highscore is a flag indicating that the score is above 95.

Why are booleans used as flags?

Concept: Boolean Flags Boolean values are regularly used to help maintain the state of a given piece of code. It is common to describe boolean variables as “boolean flags” - these often are used to turn on and off different behaviors that might be useful.

How will you create a table with boolean data type in SQL Server?

CREATE TABLE testbool ( sometext TEXT, is_checked BOOLEAN ); You can insert a boolean value using the INSERT statement: INSERT INTO testbool (sometext, is_checked) VALUES ('a', TRUE); INSERT INTO testbool (sometext, is_checked) VALUES ('b', FALSE); When you select a boolean value, it is displayed as either 't' or 'f'.

How do you create a boolean column in SQL?

In SQL Server, a Boolean Datatype can be created by means of keeping BIT datatype. Though it is a numeric datatype, it can accept either 0 or 1 or NULL values only. Hence easily we can assign FALSE values to 0 and TRUE values to 1. This will provide the boolean nature for a data type.


1 Answers

This answer is for ISO/IEC/ANSI Standard SQL, and includes the better freeware pretend-SQLs.

First problem is you have identified two Categories, not one, so they cannot be reasonably compared.

A. Category One

(1) (4) and (5) contain multiple possible values and are one category. All can be easily and effectively used in the WHERE clause. They have the same storage so neither storage nor read performance is an issue. Therefore the remaining choice is simply based on the actual Datatype for the purpose of the column.

ENUM is non-standard; the better or standard method is to use a lookup table; then the values are visible in a table, not hidden, and can be enumerated by any report tool. The read performance of ENUM will suffer a small hit due to the internal processing.

B. Category Two

(2) and (3) are Two-Valued elements: True/False; Male/Female; Dead/Alive. That category is different to Category One. Its treatment both in your data model, and in each platform, is different. BOOLEAN is just a synonym for BIT, they are the same thing. Legally (SQL-wise) there are handled the same by all SQL-compliant platforms, and there is no problem using it in the WHERE clause.

The difference in performance depends on the platform. Sybase and DB2 pack up to 8 BITs into one byte (not that storage matters here), and map the power-of-two on the fly, so performance is really good. Oracle does different things in each version, and I have seen modellers use CHAR(1) instead of BIT, to overcome performance problems. MS was fine up to 2005 but they have broken it with 2008, as in the results are unpredictable; so the short answer may be to implement it as CHAR(1).

Of course, the assumption is that you do not do silly things such as pack 8 separate columns in to one TINYINT. Not only is that a serious Normalisation error, it is a nightmare for coders. Keep each column discrete and of the correct Datatype.

C. Multiple Indicator & Nullable Columns

This has nothing to do with, and is independent of, (A) and (B). What the columns correct Datatype is, is separate to how many you have and whether it is Nullable. Nullable means (usually) the column is optional. Essentially you have not completed the modelling or Normalisation exercise. The Functional Dependencies are ambiguous. if you complete the Normalisation exercise, there will be no Nullable columns, no optional columns; either they clearly exist for a particular relation, or they do not exist. That means using the ordinary Relational structure of Supertype-Subtypes.

Sure, that means more tables, but no Nulls. Enterpise DBMS have no problem with more tables or more joins, that is what they are optimised for. Normalised databases perform much better than unnormalised or denormalised ones, and they can be extended without "re-factoring'. You can ease the use by supplying a View for each Subtype.

If you want more information on this subject, look at this question/answer. If you need help with the modelling, please ask a new question. At your level of questioning, I would advise that you stick with 5NF.

D. Performance of Nulls

Separately, if performance is important to you, then exclude Nulls. Each Nullable column is stored as variable length; that requires additional processing for each row/column. The enterprise databases use a "deferred" handling for such rows, to allow the logging, etc to move thought the queues without impeding the fixed rows. In particular never use variable length columns (that includes Nullable columns) in an Index: that requires unpacking on every access.

E. Poll

Finally, I do not see the point in this question being a poll. It is fair enough that you will get technical answers, and even opinions, but polls are for popularity contests, and the technical ability of responders at SO covers a very range, so the most popular answers and the most technically correct answers are at two different ends of the spectrum.

like image 51
PerformanceDBA Avatar answered Sep 22 '22 14:09

PerformanceDBA