Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Postgresql constraint to check for non-ascii characters

I have a Postgresql 9.3 database that is encoded 'UTF8'. However, there is a column in database that should never contain anything but ASCII. And if non-ascii gets in there, it causes a problem in another system that I have no control over. Therefore, I want to add a constraint to the column. Note: I already have a BEFORE INSERT trigger - so that might be a good place to do the check.

What's the best way to accomplish this in PostgreSQL?

like image 998
David S Avatar asked Sep 19 '14 01:09

David S


2 Answers

Use a CHECK constraint built around a regular expression.

Assuming that you mean a certain column should never contain anything but the lowercase letters from a to z, the uppercase letters from A to Z, and the numbers 0 through 9, something like this should work.

alter table your_table
add constraint allow_ascii_only
check (your_column ~ '^[a-zA-Z0-9]+$');

This is what people usually mean when they talk about "only ASCII" with respect to database columns, but ASCII also includes glyphs for punctuation, arithmetic operators, etc. Characters you want to allow go between the square brackets.

like image 40
Mike Sherrill 'Cat Recall' Avatar answered Sep 22 '22 23:09

Mike Sherrill 'Cat Recall'


You can define ASCII as ordinal 1 to 127 for this purpose, so the following query will identify a string with "non-ascii" values:

SELECT exists(SELECT 1 from regexp_split_to_table('abcdéfg','') x where ascii(x) not between 1 and 127);

but it's not likely to be super-efficient, and the use of subqueries would force you to do it in a trigger rather than a CHECK constraint.

Instead I'd use a regular expression. If you want all printable characters then you can use a range in a check constraint, like:

CHECK (my_column ~ '^[ -~]*$')

this will match everything from the space to the tilde, which is the printable ASCII range.

If you want all ASCII, printable and nonprintable, you can use byte escapes:

CHECK (my_column ~ '^[\x00-\x7F]*$')

The most strictly correct approach would be to convert_to(my_string, 'ascii') and let an exception be raised if it fails ... but PostgreSQL doesn't offer an ascii (i.e. 7-bit) encoding, so that approach isn't possible.

like image 196
Craig Ringer Avatar answered Sep 21 '22 23:09

Craig Ringer