Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Query Chinese characters(utf-8) in Google Big Query

I want to query out titles which contains Chinese characters(ex:數學) from my google dataset, and I hava tried many methods as follows.

Google big query only has LENGTH() function,and it doesn't hava DATALENGTH() to compare the difference of length and datasize.

Then, I try to use REGEXP_MATCH() '[\u4e00-\u9fa5]' to match Chinese characters, but it doesn't work, too.

I can't figure out if there are other methods to solve this problem. Please help, thank you.

like image 207
user3833974 Avatar asked Dec 09 '25 07:12

user3833974


1 Answers

BigQuery's LENGTH function currently has a bug which returns the incorrect STRING length for characters that fall out of the ASCII encoding range: https://code.google.com/p/google-bigquery/issues/detail?id=109

Possible workaround: If you just need an accurate LENGTH count, you could use the REGEXP_REPLACE function to convert your characters into a random ASCII character (such as '_'), and count that:

SELECT '數學', 
        LENGTH(REGEXP_REPLACE('數學', r'.', '_')) as correct, 
        LENGTH('數學') as incorrect;
like image 77
Michael Manoochehri Avatar answered Dec 12 '25 07:12

Michael Manoochehri



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!