Hive Sql dynamically get null column counts from a table

Question

I am using datastax + spark integration and spark SQL thrift server, which gives me a Hive SQL interface to query the tables in Cassandra.

The tables in my database get dynamically created, what I want to do is get a count of null values in each column for the table based on just the table name.

I can get the column names using describe database.table but in hive SQL, how do I use its output in another select query which counts null for all the columns.

Update 1: Traceback with Dudu's solution

Error running query: TExecuteStatementResp(status=TStatus(errorCode=0, errorMessage="org.apache.spark.sql.AnalysisException: Invalid usage of '*' in explode/json_tuple/UDTF;", sqlState=None, infoMessages=["org.apache.hive.service.cli.HiveSQLException:org.apache.spark.sql.AnalysisException: Invalid usage of '' in explode/json_tuple/UDTF;:16:15", 'org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation:org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute:SparkExecuteStatementOperation.scala:258', 'org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation:runInternal:SparkExecuteStatementOperation.scala:152', 'org.apache.hive.service.cli.operation.Operation:run:Operation.java:257', 'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementInternal:HiveSessionImpl.java:388', 'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatement:HiveSessionImpl.java:369', 'org.apache.hive.service.cli.CLIService:executeStatement:CLIService.java:262', 'org.apache.hive.service.cli.thrift.ThriftCLIService:ExecuteStatement:ThriftCLIService.java:437', 'org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1313', 'org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1298', 'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39', 'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39', 'org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:56', 'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:286', 'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1142', 'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:617', 'java.lang.Thread:run:Thread.java:745'], statusCode=3), operationHandle=None)

David דודו Markovitz · Accepted Answer

In the following solution there is no need to deal with each column separately. The result is a column index and the number of null values in that column.
You can later on join it by the column index to an information retrieved from the metastore.
One limitations is that strings containning the exact text null will be counted as nulls.

Demo

The CTE (mytable as defined by with mytable as) can obviously be replaced by as actual table

with        mytable as 
            (
                select  stack
                        (
                            5
                            
                           ,1   ,1.2     ,date '2017-06-21'     ,null
                           ,2   ,2.3     ,null                  ,null
                           ,3   ,null    ,null                  ,'hello'
                           ,4   ,4.5     ,null                  ,'world'
                           ,5   ,null    ,date '2017-07-22'     ,null
                        ) as (id,amt,dt,txt)
            )

select      pe.pos                                          as col_index
           ,count(case when pe.val='null' then 1 end)       as nulls_count

from        mytable t lateral view posexplode (split(printf(concat('%s',repeat('\u0001%s',field(unhex(1),t.*,unhex(1))-2)),t.*),'\x01')) pe

group by    pe.pos       
;

+-----------+-------------+
| col_index | nulls_count |
+-----------+-------------+
|         0 |           0 |
|         1 |           2 |
|         2 |           3 |
|         3 |           3 |
+-----------+-------------+

Hive Sql dynamically get null column counts from a table

Tags:

apache-spark-sql

hive

hiveql

rajat

Video Answer

1 Answers

Demo

David דודו Markovitz

Recent Activity

Donate For Us

Hive Sql dynamically get null column counts from a table

Tags:

apache-spark-sql

hive

hiveql

rajat

Video Answer

1 Answers

Demo

David דודו Markovitz

Related questions

Recent Activity

Donate For Us