Remove Unicode characters while querying in Hive

Question

I want to clean the unicode the data from the Hive table. The following is the data,

select ('http://10.0.0.1/ï¿½ï¿½ï¿½mï¿½ï¿½vï¿½ï¿½ï¿½ï¿½ï¿½ï¿½)ï¿½aï¿½^ï¿½ï¿½ï¿½ï¿½ï¿½kn:4ï¿½+9xï¿½2cï¿½ï¿½mï¿½{ï¿½ï¿½')

My required output is to find if there are any unicode characters in my column and to remove it. The output here should be,

http://10.0.0.1/

or completely null. Either of them is fine. If a row contains any unicode character, it is fine to make it null completely.

The following are my tryings,

 select REGEXP_REPLACE('http://10.0.0.1/ï¿½ï¿½ï¿½mï¿½ï¿½vï¿½ï¿½ï¿½ï¿½ï¿½ï¿½)ï¿½aï¿½^ï¿½ï¿½ï¿½ï¿½ï¿½kn:4ï¿½+9xï¿½2cï¿½ï¿½mï¿½{ï¿½ï¿½', '\[[:xdigit:]]{4}', '')

and

 select REGEXP_REPLACE('http://10.0.0.1/ï¿½ï¿½ï¿½mï¿½ï¿½vï¿½ï¿½ï¿½ï¿½ï¿½ï¿½)ï¿½aï¿½^ï¿½ï¿½ï¿½ï¿½ï¿½kn:4ï¿½+9xï¿½2cï¿½ï¿½mï¿½{ï¿½ï¿½', '[||chr(128)||'-'||chr(255)||]', '')

Executed as Single statement.  Failed [40000 : 42000] Error while compiling statement: FAILED: ParseException line 1:193 mismatched input '<EOF>' expecting ) near ')' in function specification 
Elapsed time = 00:00:00.220 
 
STATEMENT 1: SELECT Statement failed.

Can somebody help me in cleaning these in my table?

Places where is is working,

select REGEXP_REPLACE('"http://r.rxthdr.com/w?i=sï¿½Fï¿½""ï¿½HY|ï¿½Kï¿½>ï¿½0ï¿½ï¿½ï¿½ï¿½Dï¿½ï¿½ï¿½ï¿½W8ë¤’ï¿½O0ï¿½Qï¿½Dï¿½1ï¿½ï¿½Vc~ï¿½j[Qï¿½ï¿½fï¿½ï¿½{uï¿½Beï¿½S>nï¿½ï¿½ï¿½Òï¿½ï¿½ï¿½&ï¿½ï¿½F9ï¿½ï¿½ï¿½Cï¿½iï¿½ï¿½8:Ú”ï¿½_@ÄªOï¿½ï¿½K?ï¿½Ä’cï¿½6ï¿½ï¿½=ï¿½ï¿½v[ï¿½ï¿½ï¿½ï¿½ï¿½Dï¿½$%ï¿½ï¿½:ï¿½aï¿½40Ý©ï¿½&Oï¿½ï¿½Kï¿½ï¿½""ï¿½0ï¿½a<xï¿½ï¿½TcXï¿½ï¿½ï¿½bï¿½ï¿½TNï¿½}ï¿½xï¿½oï¿½ï¿½UY$Kï¿½Iï¿½Õ•""ï¿½ï¿½(+ï¿½Mï¿½ï¿½ï¿½Eï¿½=Kï¿½Aï¿½Iï¿½Aï¿½ï¿½ï¿½q#lï¿½(ï¿½ytï¿½5ï¿½ï¿½h}ï¿½ï¿½~[ï¿½ï¿½YOAï¿½ï¿½Gï¿½=ïˆï¿½{ï¿½ï¿½ï¿½. ï¿½Qï¿½ï¿½ï¿½Ø;x=ï¿½sï¿½0:ï¿½', '(?s).*\P{ASCII}.*', '')

Places where it is not working,

 select REGEXP_REPLACE('c4k0j,}W""d+2|4y0hkCkRh+.{pq80{?X8O>b<:ph.3!{T', '(?s).*\P{ASCII}.*', '')

 select REGEXP_REPLACE('z|""},}69]6N2|c_;5.su={IU+|8ubq1<r$!Xxy#?Bhkv20:jXNgRh+5fwj:ndfWBJ}e)>','(?s).*\P{ASCII}.*', '')

The first one in the image has a unicode character. But while pasting it becomes a dot.

enter image description here

Wiktor Stribiżew · Accepted Answer

You may use

select REGEXP_REPLACE(YOUR_STRING_HERE, '\P{ASCII}.*', '')

It will remove all the string up to its end from the first found non-ASCII char.

Hive regex supports Unicode property classes, and \p{ASCII} matches any ASCII char. The opposite Unicode property classes are formed by turning p to upper case. So, \P{ASCII} matches any char that is not ASCII. .* matches any 0+ chars as many as possible, as * is a greedy quantifier.

Note that . does not match line breaks by default. If you need to remove line breaks, add (?s) at the start of the pattern:

'(?s)\P{ASCII}.*'

Remove Unicode characters while querying in Hive

Tags:

regex

sql

unicode

hadoop

hive

haimen

1 Answers

Wiktor Stribiżew

Recent Activity

Donate For Us

Remove Unicode characters while querying in Hive

Tags:

regex

sql

unicode

hadoop

hive

haimen

1 Answers

Wiktor Stribiżew

Related questions

Recent Activity

Donate For Us