I have a hive table with the following schema:
COOKIE | PRODUCT_ID | CAT_ID | QTY 1234123 [1,2,3] [r,t,null] [2,1,null]
How can I normalize the arrays so I get the following result
COOKIE | PRODUCT_ID | CAT_ID | QTY 1234123 [1] [r] [2] 1234123 [2] [t] [1] 1234123 [3] null null
I have tried the following:
select concat_ws('|',visid_high,visid_low) as cookie ,pid ,catid ,qty from table lateral view explode(productid) ptable as pid lateral view explode(catalogId) ptable2 as catid lateral view explode(qty) ptable3 as qty
however the result comes out as a Cartesian product.
Description. The LATERAL VIEW clause is used in conjunction with generator functions such as EXPLODE , which will generate a virtual table containing one or more rows. LATERAL VIEW will apply the rows to each original output row.
Explode() function takes an array as an input and results elements of that array as separate rows. Select explode(column_name) from table_name; In below example, we have column technology as array of string. And if we use explode function on technology column, each value of array is separated into rows.
A lateral view first applies the UDTF to each row of base table and then joins resulting output rows to the input rows to form a virtual table having the supplied table alias. Version. Prior to Hive 0.6. 0, lateral view did not support the predicate push-down optimization. In Hive 0.5.
I found a very good solution to this problem without using any UDF, posexplode is a very good solution :
SELECT COOKIE , ePRODUCT_ID, eCAT_ID, eQTY FROM TABLE LATERAL VIEW posexplode(PRODUCT_ID) ePRODUCT_IDAS seqp, ePRODUCT_ID LATERAL VIEW posexplode(CAT_ID) eCAT_ID AS seqc, eCAT_ID LATERAL VIEW posexplode(QTY) eQTY AS seqq, eDateReported WHERE seqp = seqc AND seqc = seqq;
You can use the numeric_range
and array_index
UDFs from Brickhouse ( http://github.com/klout/brickhouse ) to solve this problem. There is an informative blog posting describing in detail over at http://brickhouseconfessions.wordpress.com/2013/03/07/exploding-multiple-arrays-at-the-same-time-with-numeric_range/
Using those UDFs, the query would be something like
select cookie, array_index( product_id_arr, n ) as product_id, array_index( catalog_id_arr, n ) as catalog_id, array_index( qty_id_arr, n ) as qty from table lateral view numeric_range( size( product_id_arr )) n1 as n;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With