Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL find non-null columns

Tags:

sql

I have a table of time-series data of which I need to find all columns that contain at least one non-null value within a given time period. So far I am using the following query:

select max(field1),max(field2),max(field3),... 
   from series where t_stamp between x and y

Afterwards I check each field of the result if it contains a non-null value.

The table has around 70 columns and a time period can contain >100k entries.

I wonder if there if there is a faster way to do this (using only standard sql).

EDIT: Unfortunately, refactoring the table design is not an option for me.

like image 386
Wolf Avatar asked Mar 02 '23 04:03

Wolf


1 Answers

The EXISTS operation may be faster since it can stop searching as soon as it finds any row that matches the criteria (vs. the MAX which you are using). It depends on your data and how smart your SQL server is. If most of your columns have a high rate of non-null data then this method will find rows quickly and it should run quickly. If your columns are mostly NULL values then your method may be faster. I would give them both a shot and see how they are each optimized and how they run. Also keep in mind that performance may change over time if the distribution of your data changes significantly.

Also, I've only tested this on MS SQL Server. I haven't had to code strict ANSI compatible SQL in over a year, so I'm not sure that this is completely generic.

SELECT
     CASE WHEN EXISTS (SELECT * FROM Series WHERE t_stamp BETWEEN @x AND @y AND field1 IS NOT NULL) THEN 1 ELSE 0 END AS field1,
     CASE WHEN EXISTS (SELECT * FROM Series WHERE t_stamp BETWEEN @x AND @y AND field2 IS NOT NULL) THEN 1 ELSE 0 END AS field2,
...

EDIT: Just to clarify, the MAX method might be faster since it could determine those values with a single pass through the data. Theoretically, the method here could as well, and potentially with less than a full pass, but your optimizer may not recognize that all of the subqueries are related, so it might do separate passes for each. That still might be faster, but as I said it depends on your data.

like image 108
Tom H Avatar answered Mar 11 '23 13:03

Tom H