I want to query out titles which contains Chinese characters(ex:數學) from my google dataset, and I hava tried many methods as follows.
Google big query only has LENGTH() function,and it doesn't hava DATALENGTH() to compare the difference of length and datasize.
Then, I try to use REGEXP_MATCH() '[\u4e00-\u9fa5]' to match Chinese characters, but it doesn't work, too.
I can't figure out if there are other methods to solve this problem. Please help, thank you.
BigQuery's LENGTH function currently has a bug which returns the incorrect STRING length for characters that fall out of the ASCII encoding range: https://code.google.com/p/google-bigquery/issues/detail?id=109
Possible workaround: If you just need an accurate LENGTH count, you could use the REGEXP_REPLACE function to convert your characters into a random ASCII character (such as '_'), and count that:
SELECT '數學',
LENGTH(REGEXP_REPLACE('數學', r'.', '_')) as correct,
LENGTH('數學') as incorrect;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With