Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Need help with Regular Expression for nine digit alphanumeric with minimum one space boundary

Tags:

c#

regex

I'm trying to match a CUSIP number. I have the following, but it is missing some edge cases.

\s[A-Za-z0-9]{9}\s

I need to omit strings which contain a space in the middle and I need it to match strings which may be bordered by some other text. My strings are generally surrounded by tabs, but it may be as little as one space char separating the CUSIP from other text. Thanks in advance, I'm pretty green with regex. P.S. I'm working in .NET

Example

"[TAB]123456789[TAB]" should be matched (I'm getting this now)

"sometext[TAB]123456789[TAB]sometext" should be matched (this is not currently being returned)

"some text" should not be returned (I am currently getting this kind of match)

like image 671
kirps Avatar asked May 05 '11 17:05

kirps


2 Answers

According to this page, not just any 9-digit alphanumeric is a valid CUSIP. The first three characters can only be digits, and the ninth is a checksum So if you want to distinguish CUSIPs from other 9-character strings, I believe this should work better:

\s[0-9]{3}[a-zA-Z0-9]{6}\s

or, if you also want to match strings that are bordered by the beginning or end of input:

(^|\s)[0-9]{3}[a-zA-Z0-9]{6}(\s|$)

or, if you also want to match strings that are bordered by punctuation (such as "(100ABCDEF)":

(^|[^a-zA-Z0-9])[0-9]{3}[a-zA-Z0-9]{6}([^a-zA-Z0-9]|$)

I believe that should be a 99% solution, but if you want to be really robust you might also want to look into using the 9th (parity) character to verify that the strings are valid.

like image 98
Sean U Avatar answered Sep 21 '22 23:09

Sean U


The other answers are wrong, not taking into account PPNs and allowing the check digit to be a letter. Therefore, here's a better solution.

Based on this document and this document, the CUSIPs have the following rules:

  • Length is 9 characters.
  • Characters 1, 2, 3 are digits.
  • Characters 4, 5, 6, 7, 8 are either letters or digits.
  • Characters 6, 7, 8 can also be *, @, #.
  • Character 9 is a check digit.

With this in mind, the following regex should provide a tight match:

^[0-9]{3}[a-zA-Z0-9]{2}[a-zA-Z0-9*@#]{3}[0-9]$

You can play around with it here.

Note that this is as tight as possible without diving into too much details, which would turn the expression into a monster. I suggest you use the check digit algorithm to fully validate the CUSIP, which you can find here.

like image 45
Vlad Schnakovszki Avatar answered Sep 19 '22 23:09

Vlad Schnakovszki