Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hive UDF for selecting all except some columns

The common query building pattern in HiveQL (and SQL in general) is to either select all columns (SELECT *) or an explicitly-specified set of columns (SELECT A, B, C). SQL has no built-in mechanism for selecting all but a specified set of columns.

There are various mechanisms for excluding some columns as outlined in this SO question but none apply naturally to HiveQL. (For example, the idea to create a temporary table with SELECT * then ALTER TABLE DROP some of its columns would wreak havoc in a big data environment.)

Ignoring the ideological discussion about whether it is a good idea to select all but some columns, this question is about the possible ways to extend Hive with this capability.

Prior to Hive 0.13.0 SELECT could take regular-expression-based columns, e.g., property_.* inside a backtick-quoted string. @invoketheshell's answer below refers to this capability but it comes at a cost, which is that, when this capability is on, Hive cannot accept columns with non-standard characters in them, e.g., $foo or x/y. That's why the Hive developers turned this behavior off by default in 0.13.0. I am looking for a generic solution that works for any column name.

A generic table-generating UDF (UDTF) could certainly do this because it can manipulate the schema. Since we are not going to generate new rows, is there a way to solve this problem using a simple row-based UDF?

This seems like a common problem with many posts around the Web showing how to solve it for various databases yet I haven't been able to find a solution for Hive. Is there code somewhere that does this?

like image 974
Sim Avatar asked Jul 28 '15 03:07

Sim


People also ask

How to exclude the column (s) from the SELECT statement in hive?

In this step, we will exclude the column (s) from the select statement. For this, first we have to set the below properties in the hive: Let’s say, we don’t want sports columns value. We will use below query to exclude this column. Here, set hive.cli.print.header=true property is used to show the header of the table.

How to build a hive user defined function?

Basically, with the simpler UDF API, building a Hive User Defined Function involves little more than writing a class with one function (evaluate). However, let’s see an example to understand it well: i. TESTING SIMPLE Hive UDF Moreover, we can test it with regular testing tools, like JUnit, since the Hive UDF is simple one function. b. Complex API

How to test hive UDF?

TESTING SIMPLE Hive UDF Moreover, we can test it with regular testing tools, like JUnit, since the Hive UDF is simple one function. b. Complex API

How many columns are there in a hive table?

There is an uncertain number of columns present in the hive table. Sometimes a table can have many numbers of columns and sometimes it can have few numbers of columns. If we want the value of all the columns from the table, then there is no any challenge as we can use ‘*’ from the table.


1 Answers

You can choose every column except those listed in a regex based specification. This is query columns by exclusion. See below:

A SELECT statement can take regex-based column specification in Hive releases prior to 0.13.0, or in 0.13.0 and later releases if the configuration property hive.support.quoted.identifiers is set to none.

That being said you could create a new table or view using the following, and all the columns except the columns specified will be returned:

hive.support.quoted.identifiers=none;    

drop table if       exists database.table_name;
create table if not exists database.table_name as
    select `(column_to_remove_1|...|column_to_remove_N)?+.+`
    from database.some_table
    where 
    --...
;

This will create a table that has all the columns from some_table except the columns named column_to_remove_1, ... , to column_to_remove_N. You can also choose to create a view instead.

like image 189
invoketheshell Avatar answered Oct 12 '22 22:10

invoketheshell