Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Assessing from the end of a split array in Hive

Tags:

hadoop

hive

I need to split a tag that looks something like "B1/AHU/_1/RoomTemp", "B1/AHU/_1/109/Temp", so with a variable with a variable number of fields. I am interested in getting the last field, or sometimes the last but one. I was disappointed to find that negative indexes do not count from the right and allow me to select the last element of an array in Hive as they do in Python.

select tag,split(tag,'[/]')[ -1] from sensor

I was more surprised when this did not work either:

select tag,split(tag,'[/]')[ size(split(tag,'[\]'))-1 ] from sensor

Both times giving me an error along the lines of this:

FAILED: SemanticException 1:27 Non-constant expressions for array indexes not supported. 
Error encountered near token '1'

So any ideas? I am kind of new to Hive. Regex's maybe? Or is there some syntactic sugar I am not aware of?

like image 426
Mike Wise Avatar asked Sep 05 '15 16:09

Mike Wise


People also ask

How split function works in Hive?

Split function in Hive Hive providing many string functions to manipulate the strings. Split is one of the string function in Hive that used to split the string depending on the pattern and returns the array of strings.

What is a Lateral view in Hive?

Lateral view explodes the array data into multiple rows. In other words, lateral view expands the array into rows. When you use a lateral view along with the explode function, you will get the result something like below.


4 Answers

This question is getting a lot of views (over a thousand now), so I think it needs a proper answer. In the event I solved it with this:

select tag,reverse(split(reverse(tag),'[/]')[0]) from sensor

which is not actually stated in the other suggested answers - I got the idea from a suggestion in the comments.

This:

  • reverses the string (so "abcd/efgh" is now "hgfe/dcba")
  • splits it on "/" into an array (so we have "hgfe" and "dcba")
  • extracts the first element (which is "hgfe")
  • then finally re-reverses (giving us the desired "efgh")

Also note that the second-to-last element can be retrieved by substituting 1 for the 0, and so on for the others.

like image 144
Mike Wise Avatar answered Oct 14 '22 04:10

Mike Wise


There is a great library of Hive UDFs here. One of them is LastIndexUDF(). It's pretty self-explainatory, it retrieves the last element of an array. There are instructions to build and use the jar on the main page. Hope this helps.

like image 40
o-90 Avatar answered Oct 14 '22 03:10

o-90


This seem to work for me, this returns the last element from the SPLIT array

SELECT SPLIT(INPUT__FILE__NAME,'/')[SIZE(SPLIT(INPUT__FILE__NAME,'/')) -1 ] from test_table limit 10;
like image 35
dedricF Avatar answered Oct 14 '22 04:10

dedricF


After reading the LanguageManual UDF a while, I luckily found the function substring_index exactly meets your requirement, dosen't need any additional calculations at all.

The manual says:

substring_index(string A, string delim, int count) returns the substring from string A before count occurrences of the delimiter delim (as of Hive 1.3.0). If count is positive, everything to the left of the final delimiter (counting from the left) is returned. If count is negative, everything to the right of the final delimiter (counting from the right) is returned. Substring_index performs a case-sensitive match when searching for delim. Example: substring_index('www.apache.org', '.', 2) = 'www.apache'.

Use cases:

SELECT SUBSTRING_INDEX('www.mysql.com', '.', 2);
--www.mysql

SELECT SUBSTRING_INDEX('www.mysql.com', '.', -1);
--com

See here for more information.

like image 44
henry zhu Avatar answered Oct 14 '22 04:10

henry zhu