I have column where its values delimited by pipe operator in hive table, which is highly unstructured, and I would like to split this column that delimited by pipe and get corresponding columns which is more than many apparently. So, I need to first parse the string start with HP1 then locate string between 30-40th pipe. I tried multiple of doing this, but not able to parse it correctly. Can anyone suggest possible workaround to do this in hive sql?
sample data and my current attempt
Here I pasted top 3 rows of my original data.
SSH|^~\&|EnsembleMK9|ISC|DELTA HIE^1.2.3.411593.135778^DEL|TESTMD|202209190035||ACK^A08|1214451793^137099156|P|2.5.1MSA|AA|1214451793^137099156
SSH|^~\&|DELTA HIE^1.2.3.411593.135778^DEL|PCCMM|LEAG^2.16.840.1.113883.3.2966.100.0.0.255.140^DEL|HIVEDBAKG^2.16.840.1.113883.3.2966.100.0.0.255.140^DEL|20220919053530.832||AKG^A08^AKG_A01|C1016708390|P|2.5.1|||NE||||||@SSH.3^NXG^2.16.840.1.113883.3.2966.1000.1005.152.4.264^[email protected]^PCCMM^2.16.840.1.113883.3.2966.100.0.0.255.124^DELEVN||20220919013529PID|1||5624118^^^PCCMM&2.16.840.1.113883.3.8932.101.2&DEL^MR||Anderson^Donald||19440711|M||2106-3^White|30 Pine Woods Road^^Hyde Park^NY^12538^USA||^PRS^PH^^1^845^2292777||^English|M||||||^Non-Hispanic||||||||N|||||||||PD1||||AA48^Diminico^Carlo^F^^^^^&2.16.840.1.113883.3.8932.101.4&DEL^^^^^^^^^^^^^&&&&&&&&&PCP||||||||N|20220310ROL||AD|RCP|2022062001^^^^^^^^2.16.840.1.113883.4.6^^^^KOR^^^^^^^^HP|1|O|PKCL||||AA48^Diminico^Carlo^F^^^^^&2.16.840.1.113883.3.8932.101.4&DEL^^^^^^^^^^^^||||||||||||73828528~86654484|||||||||||||||||||||||||202209190135
SSH|^~\&|PC^1.2.3.411593.135778^DEL|PC|LEAG^2.16.840.1.113883.3.2966.100.0.0.255.140^DEL|HIVEDBAKG^2.16.840.1.113883.3.2966.100.0.0.255.140^DEL|20220919053530.832||AKG^A08^AKG_A01|C1016708390|P|2.5.1|||NE||||||@SSH.3^NXG^2.16.840.1.113883.3.2966.1000.1005.152.4.264^[email protected]^PCCMM^2.16.840.1.113883.3.2966.100.0.0.255.124^DELEVN||20220919013529PID|1||5624118^^^PCCMM&2.16.840.1.113883.3.8932.101.2&DEL^MR||Anderson^Donald||19440711|M||2106-3^White|30 Pine Woods Road^^Hyde Park^NY^12538^USA||^PRS^PH^^1^845^2292777||^English|M||||||^Non-Hispanic||||||||N|||||||||PD1||||AA48^Diminico^Carlo^F^^^^^&2.16.840.1.113883.3.8932.101.4&DEL^^^^^^^^^^^^^&&&&&&&&&PCP||||||||N|20220310ROL||AD|RCP|2022062001^^^^^^^^2.16.840.1.113883.4.6^^^^KOR^^^^^^^^HP|1|O|PKCL||||AA48^Diminico^Carlo^F^^^^^&2.16.840.1.113883.3.8932.101.4&DEL^^^^^^^^^^^^||||||||||||73828528~86654484||||||||||||||||||||PCCMM|||||202209190135
here is my first attempt:
based on this SO post, I tried following:
WITH x AS (
SELECT colm
, CharIndex('|', colm) As first_pipe
FROM mytbl
)
, y AS (
SELECT colm
, first_pipe
, CharIndex('|', colm, first_pipe + 1) As second_pipe
, SubString(colm, 0, first_pipe) As first_element
FROM x
)
, z AS (
SELECT colm
, first_pipe
, second_pipe
, first_element
, SubString(colm, first_pipe + 1, second_pipe - first_pipe - 1) As second_element
, SubString(colm, second_pipe + 1, LENGTH(colm) - second_pipe) As third_element
FROM y
)
SELECT *
FROM z
limit 10
but I have error because CharIndex doesn't work for hive sql.
Here is my another attempt:
SELECT
colm,
INSTR(colm, 'HP1') AS HP1_position, -- Find the position of "HP1" in the string
SUBSTR(colm, INSTR(colm, 'HP1')) AS substring_after_HP1, -- Extract the substring after "HP1" until the end
SPLIT(SUBSTR(colm, INSTR(colm, 'HP1')), '\|') AS pipe_split, -- Split the substring by the pipe "|" delimiter
pipe_split[39] AS extracted_value -- Extract the 40th element from the split data
FROM
mytbl
this query also don't work, because pipe_split[39] give me error.
Is there any workaround to pipe delimited column into new list of columns correctly in hive sql? can anyone suggest possible approach on that? Do I need to write dynamic sql instaed? if so, how? any ideas?
objective
I want to split pipe delimited column into new columns. How can I do this in hive sql?
You might want to try FUNCTION, I have created string_split wrapper to split data based on delimiter and return the values in separated column.
CREATE FUNCTION fn_Split50
(
@str varchar(max),
@delim char(1),
@columnCnt int = 125
)
RETURNS TABLE
AS
RETURN
(
SELECT *
FROM (SELECT
nn = (nn - 1) / @columnCnt + 1,
nnn = 'value' + cast(((nn - 1) % @columnCnt) + 1 as varchar(10)),
value
FROM (SELECT
nn = ROW_NUMBER() over (order by (select null)),
value
FROM string_split(@str, @delim) aa
) aa
where nn > 0
) bb
PIVOT
(
max(value)
FOR nnn IN (
value1, value2, value3, value4, value5, value6, value7, value8, value9, value10,
value11, value12, value13, value14, value15, value16, value17, value18, value19, value20,
value21, value22, value23, value24, value25, value26, value27, value28, value29, value30,
value31, value32, value33, value34, value35, value36, value37, value38, value39, value40,
value41, value42, value43, value44, value45, value46, value47, value48, value49, value50,
value51, value52, value53, value54, value55, value56, value57, value58, value59, value60,
value61, value62, value63, value64, value65, value66, value67, value68, value69, value70,
value71, value72, value73, value74, value75, value76, value77, value78, value79, value80,
value81, value82, value83, value84, value85, value86, value87, value88, value89, value90,
value91, value92, value93, value94, value95, value96, value97, value98, value99, value100,
value101, value102, value103, value104, value105, value106, value107, value108, value109, value110,
value111, value112, value113, value114, value115, value116, value117, value118, value119, value120,
value121, value122, value123, value124, value125
)
) AS PivotTable
);
And you can call it this way:
select * from fn_split50('SSH|^~\&|DELTA HIE^1.2.3.411593.135778^DEL|PCCMM', '|', DEFAULT);
First parameter is the string value Second parameter is the delimiter Third parameter is how many columns you want to be pre-provided, if there are more values then the provided value will wrap it to the next row.
You may test it out at this SQLFiddle link.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With