Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex character count, but some count for three

Tags:

I'm trying to build a regular expression that places a limit on the input length, but not all characters count equal in this length. I'll put the rationale at the bottom of the question. As a simple example, let's limit the maximum length to 12 and allow only a and b, but b counts for 3 characters.

Allowed are:

  • aa (anything less than 12 is fine).
  • aaaaaaaaaaaa (exactly 12 is fine).
  • aaabaaab (6 + 2 * 3 = 12, which is fine).
  • abaaaaab (still 6 + 2 * 3 = 12).

Disallowed is:

  • aaaaaaaaaaaaa (13 a's).
  • bbbba (1 + 4 * 3 = 13, which is too much).
  • baaaaaaab (7 + 2 * 3 = 13, which is too much).

I've made an attempt that gets fairly close:

^(a{0,3}|b){0,4}$

This matches on up to 4 clusters that may consist of 0-3 a's or one b.

However, it fails to match on my last positive example: abaaaaab, because that forces the first cluster to be the single a at the beginning, consumes a second cluster for the b, then leaves only 2 more clusters for the rest, aaaaab, which is too long.

Constraints

  • Must run in JavaScript. This regex is supplied to Qt, which apparently uses JavaScript's syntax.
  • Doesn't really need to be fast. In the end it'll only be applied to strings of up to 40 characters. I hope it validates within 50ms or so, but slightly slower is acceptable.

Rationale

Why do I need to do this with a regular expression?

It's for a user interface in Qt via PyQt and QML. The user can type a name in a text field here for a profile. This profile name is url-encoded (special characters are replaced by %XX), and then saved on the user's file system. We encounter problems when the user types a lot of special characters, such as Chinese, which then encode to a very long file name. Turns out that at somewhere like 17 characters, this file name becomes too long for some file systems. The URL-encoding encodes as UTF-8, which has up to 4 bytes per character, resulting in up to 12 characters in the file name (as each of these gets percent-encoded).

16 characters is too short for profile names. Even some of our default names exceed that. We need a variable limit based on these special characters.

Qt normally allows you to specify a Validator to determine which values are acceptable in a text box. We tried implementing such a validator, but that resulted in a segfault upstream, due to a bug in PyQt. It can't seem to handle custom Validator implementations at the moment. However, PyQt also exposes three built-in validators. Two apply only to numbers. The third is a regex validator that allows you to put a regular expression that matches all valid strings. Hence the need for this regular expression.