Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Hive regexp_extract UDF

Tags:

hadoop

hive

I am coming across a piece of code in Apache Hive like regexp_extract(input, '[0-9]*', 0), Can someone please explain to me what this code does? Thanks

like image 917
user1653240 Avatar asked Nov 29 '22 16:11

user1653240


1 Answers

From the Hive manual DDL, it returns the string extracted using the pattern. e.g. regexp_extract('foothebar', 'foo(.*?)(bar)', 2) returns bar.

The index parameter is the capture group, which is an integer that can take the following values:

  • 0: the entire match, in my example it would be foothebar
  • 1: the first group, in my example it would be the
  • 2: the second group, in my example it would be bar
  • n: the nth group. If n is bigger than the actual number of groups defined in your regexp, your Hive query will fail.

In your example, regexp_extract(input, '[0-9]*', 0), your are looking for the whole match for your column identified by input and starting with a numerical value.

Here are a few examples:

  • regexp_extract('9eleven', '[0-9]*', 0) -> returns 9
  • regexp_extract('9eleven', '[0-9]*', 1) -> query fails
  • regexp_extract('911test', '[0-9]*', 0) -> returns 911
  • regexp_extract('911test', '[0-9]*', 1) -> query fails
  • regexp_extract('eleven', '[0-9]*', 0) -> returns empty string
  • regexp_extract('test911', '[0-9]*', 0) -> returns empty string
like image 175
Charles Menguy Avatar answered Dec 06 '22 17:12

Charles Menguy