The common query building pattern in HiveQL (and SQL in general) is to either select all columns (SELECT *
) or an explicitly-specified set of columns (SELECT A, B, C
). SQL has no built-in mechanism for selecting all but a specified set of columns.
There are various mechanisms for excluding some columns as outlined in this SO question but none apply naturally to HiveQL. (For example, the idea to create a temporary table with SELECT *
then ALTER TABLE DROP
some of its columns would wreak havoc in a big data environment.)
Ignoring the ideological discussion about whether it is a good idea to select all but some columns, this question is about the possible ways to extend Hive with this capability.
Prior to Hive 0.13.0 SELECT could take regular-expression-based columns, e.g., property_.*
inside a backtick-quoted string. @invoketheshell's answer below refers to this capability but it comes at a cost, which is that, when this capability is on, Hive cannot accept columns with non-standard characters in them, e.g., $foo
or x/y
. That's why the Hive developers turned this behavior off by default in 0.13.0. I am looking for a generic solution that works for any column name.
A generic table-generating UDF (UDTF) could certainly do this because it can manipulate the schema. Since we are not going to generate new rows, is there a way to solve this problem using a simple row-based UDF?
This seems like a common problem with many posts around the Web showing how to solve it for various databases yet I haven't been able to find a solution for Hive. Is there code somewhere that does this?
In this step, we will exclude the column (s) from the select statement. For this, first we have to set the below properties in the hive: Let’s say, we don’t want sports columns value. We will use below query to exclude this column. Here, set hive.cli.print.header=true property is used to show the header of the table.
Basically, with the simpler UDF API, building a Hive User Defined Function involves little more than writing a class with one function (evaluate). However, let’s see an example to understand it well: i. TESTING SIMPLE Hive UDF Moreover, we can test it with regular testing tools, like JUnit, since the Hive UDF is simple one function. b. Complex API
TESTING SIMPLE Hive UDF Moreover, we can test it with regular testing tools, like JUnit, since the Hive UDF is simple one function. b. Complex API
There is an uncertain number of columns present in the hive table. Sometimes a table can have many numbers of columns and sometimes it can have few numbers of columns. If we want the value of all the columns from the table, then there is no any challenge as we can use ‘*’ from the table.
You can choose every column except those listed in a regex based specification. This is query columns by exclusion. See below:
A SELECT statement can take regex-based column specification in Hive releases prior to 0.13.0, or in 0.13.0 and later releases if the configuration property hive.support.quoted.identifiers is set to none.
That being said you could create a new table or view using the following, and all the columns except the columns specified will be returned:
hive.support.quoted.identifiers=none;
drop table if exists database.table_name;
create table if not exists database.table_name as
select `(column_to_remove_1|...|column_to_remove_N)?+.+`
from database.some_table
where
--...
;
This will create a table that has all the columns from some_table except the columns named column_to_remove_1, ... , to column_to_remove_N. You can also choose to create a view instead.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With