Hive UDF for selecting all except some columns

Tags:

The common query building pattern in HiveQL (and SQL in general) is to either select all columns (SELECT *) or an explicitly-specified set of columns (SELECT A, B, C). SQL has no built-in mechanism for selecting all but a specified set of columns.

There are various mechanisms for excluding some columns as outlined in this SO question but none apply naturally to HiveQL. (For example, the idea to create a temporary table with SELECT * then ALTER TABLE DROP some of its columns would wreak havoc in a big data environment.)

Ignoring the ideological discussion about whether it is a good idea to select all but some columns, this question is about the possible ways to extend Hive with this capability.

Prior to Hive 0.13.0 SELECT could take regular-expression-based columns, e.g., property_.* inside a backtick-quoted string. @invoketheshell's answer below refers to this capability but it comes at a cost, which is that, when this capability is on, Hive cannot accept columns with non-standard characters in them, e.g., $foo or x/y. That's why the Hive developers turned this behavior off by default in 0.13.0. I am looking for a generic solution that works for any column name.

A generic table-generating UDF (UDTF) could certainly do this because it can manipulate the schema. Since we are not going to generate new rows, is there a way to solve this problem using a simple row-based UDF?

This seems like a common problem with many posts around the Web showing how to solve it for various databases yet I haven't been able to find a solution for Hive. Is there code somewhere that does this?

974

asked Jul 28 '15 03:07

Sim

1 Answers

You can choose every column except those listed in a regex based specification. This is query columns by exclusion. See below:

A SELECT statement can take regex-based column specification in Hive releases prior to 0.13.0, or in 0.13.0 and later releases if the configuration property hive.support.quoted.identifiers is set to none.

That being said you could create a new table or view using the following, and all the columns except the columns specified will be returned:

hive.support.quoted.identifiers=none;    

drop table if       exists database.table_name;
create table if not exists database.table_name as
    select `(column_to_remove_1|...|column_to_remove_N)?+.+`
    from database.some_table
    where 
    --...
;

This will create a table that has all the columns from some_table except the columns named column_to_remove_1, ... , to column_to_remove_N. You can also choose to create a view instead.

189

answered Oct 12 '22 22:10

invoketheshell

Related questions
                            
                                When to prefer Hadoop MapReduce over Spark?
                            
                                How to join big dataframes in Spark SQL? (best practices, stability, performance)
                            
                                How to fetch offset id while consuming Kafka from Spark, save it in Cassandra and use it to restart Kafka?
                            
                                How to run Spark Scala code on Amazon EMR
                            
                                Apache Spark Structured Streaming vs Apache Flink: what is the difference?
                            
                                Spark UI History server on Kubernetes?
                            
                                Spark structured streaming app reading from multiple Kafka topics
                            
                                "TypeError: an integer is required (got type bytes)" when importing pyspark on Python 3.8 [duplicate]
                            
                                Spark Clusters: worker info doesn't show on web UI
                            
                                Apache Spark: How to create a matrix from a DataFrame?
                            
                                How to connect Zeppelin to Spark 1.5 built from the sources?
                            
                                Merging multiple rows in a spark dataframe into a single row
                            
                                Spark: difference of semantics between reduce and reduceByKey
                            
                                Is Spark's KMeans unable to handle bigdata?
                            
                                Spark dataframe to arrow
                            
                                Is there a difference between OUTER & FULL_OUTER in Spark SQL?
                            
                                Calculate Cosine Similarity Spark Dataframe
                            
                                SparkSession: ActiveSession vs DefaultSession
                            
                                how to implement spark sql pagination query
                            
                                How to recommend top 10 products in Spark ALS for all the users?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Hive UDF for selecting all except some columns

Tags:

apache-spark

apache-spark-sql

hive

hiveql

udf

Sim

People also ask

1 Answers

invoketheshell

Recent Activity

Donate For Us