Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Index on column with 70% of empty values: Use null or empty value?

We are currently optimizing a MySQL table (InnoDB) that will eventially have more than 100 million rows.

In one column, we are storing IP addresses (VARCHAR 45). We need to put an index on this column, as we have to be able to retrieve all rows per specific IP address.

70% of all rows, however, will not store an IP address (empty).

Our question: Shall we store those empty values as NULL and thus ALLOW NULL on this column (will add 1 byte to each row). Or shall we NOT ALLOW NULL and store those empty values as '' (empty string)?

What is best for performance?

We will never have to search rows that are empty (= '') or null (IS NULL), only search for specific IP addresses (= '123.456.789.123').

Update: There are indeed many questions on SO that address similar scenarios. However, some answers seem to be contradictory or say "it depends". We will run some tests and post our findings for our specific scenario here.

like image 740
Lionel Avatar asked Dec 19 '15 13:12

Lionel


People also ask

Does index work with NULL values?

By default, relational databases ignore NULL values (because the relational model says that NULL means "not present"). So, Index does not store NULL value, consequently if you have null condition in SQL statement, related index is ignored (by default).

Why indexes should not be used on columns that contain a high number of NULL values?

In general, indexes on binary columns are not useful. The purpose of indexes is to reduce the number of data pages that need to be read. In general, binary columns are going to have records with both values on any given data page. There are two exceptions, but the second doesn't apply to Postgres.

Can column that contain a high number of NULL values be used for indexing?

Yep, SQL Server stores the nulls in the index. That makes sense, really, because sooner or later, you're going to want to find the rows with nulls, and when you've only got a few nulls in a big table, the index can help a lot there.

Can an index be empty?

The only restriction is that elements in an Index must be nonnegative. The same value can appear multiple times in an Index. Indexes can be empty.


2 Answers

Go with NULL values. InnoDB has no space cost for NULLs, and NULL values are excluded from indexes, so you'll have a faster index lookup for the values which are present.

As far as how you store the IP itself (string verus number), that seems like a far less important point of optimization.

like image 101
manchicken Avatar answered Sep 18 '22 15:09

manchicken


VARCHAR(39) is sufficient for both IPv4 (the old format, for which there are no more values available) and IPv6.

The optimizer may screw up if 70% of the values are the same ('' or NULL). I suggest you have another table with the IP and an ID for JOINing back to your original table. By having no 'empty' IPs in the second table, the optimizer is more likely to "do the right thing".

With that, LEFT JOIN can be used to see if there is an IP.

IPv6 can be stored in BINARY(16) to save space.

like image 45
Rick James Avatar answered Sep 19 '22 15:09

Rick James