I am using datastax + spark integration and spark SQL thrift server, which gives me a Hive SQL interface to query the tables in Cassandra.
The tables in my database get dynamically created, what I want to do is get a count of null values in each column for the table based on just the table name.
I can get the column names using describe database.table
but in hive SQL, how do I use its output in another select query which counts null for all the columns.
Update 1: Traceback with Dudu's solution
Error running query: TExecuteStatementResp(status=TStatus(errorCode=0, errorMessage="org.apache.spark.sql.AnalysisException: Invalid usage of '*' in explode/json_tuple/UDTF;", sqlState=None, infoMessages=["org.apache.hive.service.cli.HiveSQLException:org.apache.spark.sql.AnalysisException: Invalid usage of '' in explode/json_tuple/UDTF;:16:15", 'org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation:org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute:SparkExecuteStatementOperation.scala:258', 'org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation:runInternal:SparkExecuteStatementOperation.scala:152', 'org.apache.hive.service.cli.operation.Operation:run:Operation.java:257', 'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementInternal:HiveSessionImpl.java:388', 'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatement:HiveSessionImpl.java:369', 'org.apache.hive.service.cli.CLIService:executeStatement:CLIService.java:262', 'org.apache.hive.service.cli.thrift.ThriftCLIService:ExecuteStatement:ThriftCLIService.java:437', 'org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1313', 'org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1298', 'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39', 'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39', 'org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:56', 'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:286', 'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1142', 'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:617', 'java.lang.Thread:run:Thread.java:745'], statusCode=3), operationHandle=None)
In the following solution there is no need to deal with each column separately.
The result is a column index and the number of null values in that column.
You can later on join it by the column index to an information retrieved from the metastore.
One limitations is that strings containning the exact text null
will be counted as nulls.
The CTE (mytable
as defined by with mytable as
) can obviously be replaced by as actual table
with mytable as
(
select stack
(
5
,1 ,1.2 ,date '2017-06-21' ,null
,2 ,2.3 ,null ,null
,3 ,null ,null ,'hello'
,4 ,4.5 ,null ,'world'
,5 ,null ,date '2017-07-22' ,null
) as (id,amt,dt,txt)
)
select pe.pos as col_index
,count(case when pe.val='null' then 1 end) as nulls_count
from mytable t lateral view posexplode (split(printf(concat('%s',repeat('\u0001%s',field(unhex(1),t.*,unhex(1))-2)),t.*),'\\x01')) pe
group by pe.pos
;
+-----------+-------------+
| col_index | nulls_count |
+-----------+-------------+
| 0 | 0 |
| 1 | 2 |
| 2 | 3 |
| 3 | 3 |
+-----------+-------------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With