Query Chinese characters(utf-8) in Google Big Query

Question

I want to query out titles which contains Chinese characters(ex:數學) from my google dataset, and I hava tried many methods as follows.

Google big query only has LENGTH() function,and it doesn't hava DATALENGTH() to compare the difference of length and datasize.

Then, I try to use REGEXP_MATCH() '[\u4e00-\u9fa5]' to match Chinese characters, but it doesn't work, too.

I can't figure out if there are other methods to solve this problem. Please help, thank you.

Michael Manoochehri · Accepted Answer

BigQuery's LENGTH function currently has a bug which returns the incorrect STRING length for characters that fall out of the ASCII encoding range: https://code.google.com/p/google-bigquery/issues/detail?id=109

Possible workaround: If you just need an accurate LENGTH count, you could use the REGEXP_REPLACE function to convert your characters into a random ASCII character (such as '_'), and count that:

SELECT '數學', 
        LENGTH(REGEXP_REPLACE('數學', r'.', '_')) as correct, 
        LENGTH('數學') as incorrect;

Query Chinese characters(utf-8) in Google Big Query

Tags:

sql

google-bigquery

user3833974

1 Answers

Michael Manoochehri

Recent Activity

Donate For Us

Query Chinese characters(utf-8) in Google Big Query

Tags:

sql

google-bigquery

user3833974

1 Answers

Michael Manoochehri

Related questions

Recent Activity

Donate For Us