Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can my Regex be improved?

Yes, another Regex question. You're welcome ;-P

This is the first time I've written my own regex for some simple string validation in C#. I think I've got it working but as a learning exercise I was wondering if it could be improved and whether I have made any mistakes.

The strings will all look something like this:

T20160307.0001

Rules:

  • Begin with the letter T.
  • Date in the format YYYYMMDD.
  • A full stop.
  • Last 4 characters are always numeric. There should be exactly 4.

Here is my regex (fiddle):

^(?i)[T]20[0-9]{2}[0-1][0-9][0-3][0-9].\d{4}$

  • ^ Assert the start of the string.
  • (?i)[T] Check that we have a letter T, case insensitive.
  • 20 YYYY begins with 20 (I'll be dead by 2100 so I don't care about anything further :-P)
  • [0-9]{2} Any number between 0 and 99 for second part of YYYY.
  • [0-1][0-9] 0 or 1 for first part of month, 0-9 for second part of month.
  • [0-3][0-9] 0-3 for first part of day, 0-9 for second part of day.
  • . Full stop.
  • \d{4} 4 numerical characters.
  • $ Assert end of string.

One pitfall I can already see is date validation. 20161935 (the 35th day of the 19th month) is considered valid. I've read some / other / posts about achieving this which I believe match on number ranges but I was unable to understand the format.

I would accept an answer that simply solved the date issue if someone would be kind enough to ELI5 how this works, but other improvements would be a welcome bonus.

Edit: To avoid further confusion I should state that I know about DateTime.TryParse etc. As mentioned I'm using this as an opportunity to learn Regex and felt this was a good starting point. Sorry to anyone who's time I wasted, I should have made this clear in the original post.

like image 805
Equalsk Avatar asked Feb 07 '23 13:02

Equalsk


1 Answers

The things you can do are:

  • avoid the \d character class that matches all the unicode digits (since you only need the ascii digits)
  • instead of [0-1] you can write [01]
  • escape the dot to figure a literal dot (and not any characters)
  • no need to put T in a character class if it is the only character
  • eventually you can remove the inline modifier and use [Tt] in place of T


^(?i)T20[0-9]{2}[01][0-9][0-3][0-9]\.[0-9]{4}$

or

^[Tt]20[0-9]{2}[01][0-9][0-3][0-9]\.[0-9]{4}$

Other thing: do you really need to add extra checking for the date since you can't really test if the date is well formatted? (Think a minute about leap years) So why not:

^(?i)T(20[0-9]{6})\.[0-9]{4}$

and if you want to know if the date really exists, capture it and test it with DateTime.TryParse method.

like image 89
Casimir et Hippolyte Avatar answered Feb 12 '23 02:02

Casimir et Hippolyte