Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to split pipe delimited column into new columns in hive sql?

Tags:

sql

parsing

I have column where its values delimited by pipe operator in hive table, which is highly unstructured, and I would like to split this column that delimited by pipe and get corresponding columns which is more than many apparently. So, I need to first parse the string start with HP1 then locate string between 30-40th pipe. I tried multiple of doing this, but not able to parse it correctly. Can anyone suggest possible workaround to do this in hive sql?

sample data and my current attempt

Here I pasted top 3 rows of my original data.

SSH|^~\&|EnsembleMK9|ISC|DELTA HIE^1.2.3.411593.135778^DEL|TESTMD|202209190035||ACK^A08|1214451793^137099156|P|2.5.1MSA|AA|1214451793^137099156
SSH|^~\&|DELTA HIE^1.2.3.411593.135778^DEL|PCCMM|LEAG^2.16.840.1.113883.3.2966.100.0.0.255.140^DEL|HIVEDBAKG^2.16.840.1.113883.3.2966.100.0.0.255.140^DEL|20220919053530.832||AKG^A08^AKG_A01|C1016708390|P|2.5.1|||NE||||||@SSH.3^NXG^2.16.840.1.113883.3.2966.1000.1005.152.4.264^[email protected]^PCCMM^2.16.840.1.113883.3.2966.100.0.0.255.124^DELEVN||20220919013529PID|1||5624118^^^PCCMM&2.16.840.1.113883.3.8932.101.2&DEL^MR||Anderson^Donald||19440711|M||2106-3^White|30 Pine Woods Road^^Hyde Park^NY^12538^USA||^PRS^PH^^1^845^2292777||^English|M||||||^Non-Hispanic||||||||N|||||||||PD1||||AA48^Diminico^Carlo^F^^^^^&2.16.840.1.113883.3.8932.101.4&DEL^^^^^^^^^^^^^&&&&&&&&&PCP||||||||N|20220310ROL||AD|RCP|2022062001^^^^^^^^2.16.840.1.113883.4.6^^^^KOR^^^^^^^^HP|1|O|PKCL||||AA48^Diminico^Carlo^F^^^^^&2.16.840.1.113883.3.8932.101.4&DEL^^^^^^^^^^^^||||||||||||73828528~86654484|||||||||||||||||||||||||202209190135
SSH|^~\&|PC^1.2.3.411593.135778^DEL|PC|LEAG^2.16.840.1.113883.3.2966.100.0.0.255.140^DEL|HIVEDBAKG^2.16.840.1.113883.3.2966.100.0.0.255.140^DEL|20220919053530.832||AKG^A08^AKG_A01|C1016708390|P|2.5.1|||NE||||||@SSH.3^NXG^2.16.840.1.113883.3.2966.1000.1005.152.4.264^[email protected]^PCCMM^2.16.840.1.113883.3.2966.100.0.0.255.124^DELEVN||20220919013529PID|1||5624118^^^PCCMM&2.16.840.1.113883.3.8932.101.2&DEL^MR||Anderson^Donald||19440711|M||2106-3^White|30 Pine Woods Road^^Hyde Park^NY^12538^USA||^PRS^PH^^1^845^2292777||^English|M||||||^Non-Hispanic||||||||N|||||||||PD1||||AA48^Diminico^Carlo^F^^^^^&2.16.840.1.113883.3.8932.101.4&DEL^^^^^^^^^^^^^&&&&&&&&&PCP||||||||N|20220310ROL||AD|RCP|2022062001^^^^^^^^2.16.840.1.113883.4.6^^^^KOR^^^^^^^^HP|1|O|PKCL||||AA48^Diminico^Carlo^F^^^^^&2.16.840.1.113883.3.8932.101.4&DEL^^^^^^^^^^^^||||||||||||73828528~86654484||||||||||||||||||||PCCMM|||||202209190135

here is my first attempt:

based on this SO post, I tried following:

WITH x AS (
  SELECT colm
       , CharIndex('|', colm) As first_pipe
  FROM   mytbl
)
, y AS (
  SELECT colm
       , first_pipe
       , CharIndex('|', colm, first_pipe + 1) As second_pipe
       , SubString(colm, 0, first_pipe) As first_element
  FROM   x
)
, z AS (
  SELECT colm
       , first_pipe
       , second_pipe
       , first_element
       , SubString(colm, first_pipe  + 1, second_pipe - first_pipe - 1) As second_element
       , SubString(colm, second_pipe + 1, LENGTH(colm) - second_pipe) As third_element
  FROM   y
)
SELECT *
FROM z
limit 10

but I have error because CharIndex doesn't work for hive sql.

Here is my another attempt:

SELECT
    colm,
    INSTR(colm, 'HP1') AS HP1_position,   -- Find the position of "HP1" in the string
    SUBSTR(colm, INSTR(colm, 'HP1')) AS substring_after_HP1,    -- Extract the substring after "HP1" until the end
    SPLIT(SUBSTR(colm, INSTR(colm, 'HP1')), '\|') AS pipe_split,       -- Split the substring by the pipe "|" delimiter
    pipe_split[39] AS extracted_value       -- Extract the 40th element from the split data
FROM
    mytbl

this query also don't work, because pipe_split[39] give me error.

Is there any workaround to pipe delimited column into new list of columns correctly in hive sql? can anyone suggest possible approach on that? Do I need to write dynamic sql instaed? if so, how? any ideas?

objective

I want to split pipe delimited column into new columns. How can I do this in hive sql?

like image 691
kim Avatar asked Nov 27 '25 13:11

kim


1 Answers

You might want to try FUNCTION, I have created string_split wrapper to split data based on delimiter and return the values in separated column.

CREATE FUNCTION fn_Split50
(   
    @str varchar(max),
    @delim char(1), 
    @columnCnt int = 125
)
RETURNS TABLE 
AS
RETURN 
(
    SELECT * 
    FROM (SELECT 
            nn = (nn - 1) / @columnCnt + 1, 
            nnn = 'value' + cast(((nn - 1) % @columnCnt) + 1 as varchar(10)), 
            value
        FROM (SELECT 
            nn = ROW_NUMBER() over (order by (select null)),
            value
            FROM string_split(@str, @delim) aa
            ) aa
        where nn > 0
    ) bb
    PIVOT  
    (  
    max(value) 
    FOR nnn IN (    
        value1, value2, value3, value4, value5, value6, value7, value8, value9, value10,
        value11, value12, value13, value14, value15, value16, value17, value18, value19, value20,
        value21, value22, value23, value24, value25, value26, value27, value28, value29, value30,
        value31, value32, value33, value34, value35, value36, value37, value38, value39, value40,
        value41, value42, value43, value44, value45, value46, value47, value48, value49, value50,
        value51, value52, value53, value54, value55, value56, value57, value58, value59, value60,
        value61, value62, value63, value64, value65, value66, value67, value68, value69, value70,
        value71, value72, value73, value74, value75, value76, value77, value78, value79, value80,
        value81, value82, value83, value84, value85, value86, value87, value88, value89, value90,
        value91, value92, value93, value94, value95, value96, value97, value98, value99, value100,
        value101, value102, value103, value104, value105, value106, value107, value108, value109, value110,
        value111, value112, value113, value114, value115, value116, value117, value118, value119, value120,
        value121, value122, value123, value124, value125
     )  
    ) AS PivotTable 
);

And you can call it this way:

select * from fn_split50('SSH|^~\&|DELTA HIE^1.2.3.411593.135778^DEL|PCCMM', '|', DEFAULT);

First parameter is the string value Second parameter is the delimiter Third parameter is how many columns you want to be pre-provided, if there are more values then the provided value will wrap it to the next row.

You may test it out at this SQLFiddle link.

like image 196
Raymond Natio Avatar answered Nov 30 '25 04:11

Raymond Natio



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!