Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hive Sql dynamically get null column counts from a table

I am using datastax + spark integration and spark SQL thrift server, which gives me a Hive SQL interface to query the tables in Cassandra.

The tables in my database get dynamically created, what I want to do is get a count of null values in each column for the table based on just the table name.

I can get the column names using describe database.table but in hive SQL, how do I use its output in another select query which counts null for all the columns.

Update 1: Traceback with Dudu's solution

Error running query: TExecuteStatementResp(status=TStatus(errorCode=0, errorMessage="org.apache.spark.sql.AnalysisException: Invalid usage of '*' in explode/json_tuple/UDTF;", sqlState=None, infoMessages=["org.apache.hive.service.cli.HiveSQLException:org.apache.spark.sql.AnalysisException: Invalid usage of '' in explode/json_tuple/UDTF;:16:15", 'org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation:org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute:SparkExecuteStatementOperation.scala:258', 'org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation:runInternal:SparkExecuteStatementOperation.scala:152', 'org.apache.hive.service.cli.operation.Operation:run:Operation.java:257', 'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementInternal:HiveSessionImpl.java:388', 'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatement:HiveSessionImpl.java:369', 'org.apache.hive.service.cli.CLIService:executeStatement:CLIService.java:262', 'org.apache.hive.service.cli.thrift.ThriftCLIService:ExecuteStatement:ThriftCLIService.java:437', 'org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1313', 'org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1298', 'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39', 'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39', 'org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:56', 'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:286', 'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1142', 'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:617', 'java.lang.Thread:run:Thread.java:745'], statusCode=3), operationHandle=None)

like image 754
rajat Avatar asked Jun 16 '17 09:06

rajat


Video Answer


1 Answers

In the following solution there is no need to deal with each column separately. The result is a column index and the number of null values in that column.
You can later on join it by the column index to an information retrieved from the metastore.
One limitations is that strings containning the exact text null will be counted as nulls.

Demo

The CTE (mytable as defined by with mytable as) can obviously be replaced by as actual table

with        mytable as 
            (
                select  stack
                        (
                            5
                            
                           ,1   ,1.2     ,date '2017-06-21'     ,null
                           ,2   ,2.3     ,null                  ,null
                           ,3   ,null    ,null                  ,'hello'
                           ,4   ,4.5     ,null                  ,'world'
                           ,5   ,null    ,date '2017-07-22'     ,null
                        ) as (id,amt,dt,txt)
            )

select      pe.pos                                          as col_index
           ,count(case when pe.val='null' then 1 end)       as nulls_count

from        mytable t lateral view posexplode (split(printf(concat('%s',repeat('\u0001%s',field(unhex(1),t.*,unhex(1))-2)),t.*),'\\x01')) pe

group by    pe.pos       
;

+-----------+-------------+
| col_index | nulls_count |
+-----------+-------------+
|         0 |           0 |
|         1 |           2 |
|         2 |           3 |
|         3 |           3 |
+-----------+-------------+
like image 133
David דודו Markovitz Avatar answered Jan 04 '23 15:01

David דודו Markovitz