Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filtering out or replacing non English characters from Google BigQuery

I'm extracting data from a query in Google Biqquery. I'm connecting to Google API, via a python script, executing the query within the python script and writing the results of the query into a CSV file. When I execute the query from the script for a sample data(100 rows), everything looks good. But when I execute the script for the entire data, it fails.

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 38: ordinal not in range(128)

I see that this is a python error, but this happens when the script is trying to process records which have non-English characters. I faced the same issue in Hive, but I got around it by using the RLIKE function given below

  (CASE WHEN FIELD1 not rlike '[^a-zA-Z()\\|\\d\\s\\(_)\\-\\(/):]' THEN FIELD1 ELSE 'data' END) AS FIELD1

Is there a similar method or function in Google BigQuery to find and replace non-English characters? Or, can this be handled within the python scripting?

Code snippet:

job_id, _results = MY_CLIENT.query("""select FIELD1, FIELD2, FIELD3, FIELD4 FROM TABLE1""", use_legacy_sql=True)
complete, row_count = MY_CLIENT.check_job(job_id)
results = MY_CLIENT.get_query_rows(job_id)
outfile =  open('C:\\Users\\test.csv', 'w')
for row in results:
    for key in row.keys():
        if key == 'FIELD4':
            outfile.write("%s" %str(row[key]))
        else:
            outfile.write("%s," %str(row[key]))
    outfile.write("\n")
outfile.close()  

Thanks in advance for you help!

like image 756
Jonathan Avatar asked Oct 17 '25 06:10

Jonathan


1 Answers

You can use below to remove non-ascii chars

REGEXP_REPLACE(field1, r'([^\p{ASCII}]+)', '')

Below is example you can play with to see how it works

#standardSQL
WITH `project.dataset.table` AS (
  SELECT '12 - Table - Стол - test' AS field1 UNION ALL
  SELECT '23 - Table - الطاولة' UNION ALL
  SELECT '34 - Table - שולחן' 
)
SELECT 
  REGEXP_REPLACE(field1, r'([^\p{ASCII}]+)', '') AS ascii_only,
  field1
FROM `project.dataset.table` 

with result

Row ascii_only          field1   
1   12 - Table - - test 12 - Table - Стол - test     
2   23 - Table -        23 - Table - الطاولة     
3   34 - Table -        34 - Table - שולחן   
like image 137
Mikhail Berlyant Avatar answered Oct 19 '25 21:10

Mikhail Berlyant



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!