Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex character count, but some count for three

Tags:

I'm trying to build a regular expression that places a limit on the input length, but not all characters count equal in this length. I'll put the rationale at the bottom of the question. As a simple example, let's limit the maximum length to 12 and allow only a and b, but b counts for 3 characters.

Allowed are:

  • aa (anything less than 12 is fine).
  • aaaaaaaaaaaa (exactly 12 is fine).
  • aaabaaab (6 + 2 * 3 = 12, which is fine).
  • abaaaaab (still 6 + 2 * 3 = 12).

Disallowed is:

  • aaaaaaaaaaaaa (13 a's).
  • bbbba (1 + 4 * 3 = 13, which is too much).
  • baaaaaaab (7 + 2 * 3 = 13, which is too much).

I've made an attempt that gets fairly close:

^(a{0,3}|b){0,4}$

This matches on up to 4 clusters that may consist of 0-3 a's or one b.

However, it fails to match on my last positive example: abaaaaab, because that forces the first cluster to be the single a at the beginning, consumes a second cluster for the b, then leaves only 2 more clusters for the rest, aaaaab, which is too long.

Constraints

  • Must run in JavaScript. This regex is supplied to Qt, which apparently uses JavaScript's syntax.
  • Doesn't really need to be fast. In the end it'll only be applied to strings of up to 40 characters. I hope it validates within 50ms or so, but slightly slower is acceptable.

Rationale

Why do I need to do this with a regular expression?

It's for a user interface in Qt via PyQt and QML. The user can type a name in a text field here for a profile. This profile name is url-encoded (special characters are replaced by %XX), and then saved on the user's file system. We encounter problems when the user types a lot of special characters, such as Chinese, which then encode to a very long file name. Turns out that at somewhere like 17 characters, this file name becomes too long for some file systems. The URL-encoding encodes as UTF-8, which has up to 4 bytes per character, resulting in up to 12 characters in the file name (as each of these gets percent-encoded).

16 characters is too short for profile names. Even some of our default names exceed that. We need a variable limit based on these special characters.

Qt normally allows you to specify a Validator to determine which values are acceptable in a text box. We tried implementing such a validator, but that resulted in a segfault upstream, due to a bug in PyQt. It can't seem to handle custom Validator implementations at the moment. However, PyQt also exposes three built-in validators. Two apply only to numbers. The third is a regex validator that allows you to put a regular expression that matches all valid strings. Hence the need for this regular expression.


Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!