Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex: parsing GitHub usernames (JavaScript)

I'm trying to parse GitHub usernames (that start with @) from a paragraph of text in order to link them to their associated profiles.

The GitHub username constraints are:

  • Alphanumeric with single hyphens (no consecutive hyphens)
  • Cannot begin or end with a hyphen (if it ends with a hyphen, just match everything up until there)
  • Max length of 39 characters.

For example, the following text:

Example @valid hello @valid-username: @another-valid-username, @-invalid @in--valid @ignore-last-dash- [email protected] @another-valid?

The script...

Should match:

  • @valid
  • @valid-username
  • @another-valid-username
  • @in
  • @ignore-last-dash
  • @another-valid

Should ignore:


I'm getting reasonably close with JavaScript by using:

/\B@((?!.*(-){2,}.*)[a-z0-9][a-z0-9-]{0,38}[a-z0-9])/ig

But this isn't matching usernames with a single character (such as @a).

Here are my tests to far: https://regex101.com/r/rZ5eW1/2

Is the current regex efficient? And how can I match a single non-hyphen character?

like image 943
Scott Avatar asked May 16 '15 21:05

Scott


3 Answers

/\B@([a-z0-9](?:-(?=[a-z0-9])|[a-z0-9]){0,38}(?<=[a-z0-9]))/gi

Note: When this regex runs into a character or set of characters that can't be in a username (i.e. ., --), it matches from @ up until that stopping point. OP says that's fine so I'm rolling with it. So, if the underline is the matched area (NOT the captured area):

@abc.123
@abc--123
@abc-

This works by using lots of nested groups. Regex101 has a fantastic breakdown, but here's mine anyway:

  1. \B: This is a builtin means 'not a word boundary', which seems to do the trick, though it may be problematic if something like [email protected] is a valid email address. At that point, though, it's indistinguishable from the text of someone who doesn't put spaces after punctuation[1] when they start a sentence with an @reference.

Thanks to Honore Doktorr for pointing out that negative lookbehinds don't exist in JS.

  1. @: Just the literal @ symbol. One of the few places where a character means what it is.
  2. (...): The capturing group. The way it's placed means that it won't capture the @ symbol, it'll just match it, so it's easier to get the username -- no need to get a substring.
  3. [a-z0-9]: A character class to match any letter or number. Because of the i flag, this also matches capital letters. Because it's the first letter, it must be present.
  4. (?:...): This is a noncapturing group. It wraps a block of regex in a group without capturing it.
  5. ...|... We have two alternatives, which are...
  6. -(?=[a-z0-9]): A hyphen, followed immediately by a non-hyphen valid character.
  7. [a-z0-9]: A valid non-hyphen character.
  8. {0,38}: Match the noncapturing group between 0 and 38 times, inclusive. Combined with #4, this gives us 39 letters maximum. Anything beyond that will be ignored.
  9. (?<=[a-z0-9]): This is a positive lookbehind, which JS does support. It ensures the last character isn't a - -- or rather, is a valid character except hyphen.

This could be 'optimized' a few ways, but honestly, I'd probably use a much simpler regex and do some validation after-the-fact on it, e.g.:

// somehow get the prospective username into `user`
if (user.startsWith('-')) { /* reject */ }
if (user.endsWith('-')) { /* reject */ }
if (user.contains('--')) { /* reject */ }

At a bare minimum, explain the regex in your code. Feel free to copy-paste mine with credit.

like image 119
Nic Avatar answered Oct 12 '22 23:10

Nic


This expression will also match your one-word usernames.

/\B@(?!.*(-){2,}.*)[a-z0-9](?:[a-z0-9-]{0,37}[a-z0-9])?\b/ig

Sample. Explanation:

  1. (?!.*(-){2,}.*): your negative lookahead asserts that the rest of the pattern can’t contain two or more adjacent dashes.
  2. [a-z0-9]: there must be one alphanumeric character immediately after @.
  3. (?:[a-z0-9-]{0,37}[a-z0-9])?: there may be anywhere from 0–37 alphanumeric characters or dashes, followed by one alphanumeric character, immediately after #2’s pattern — or there may be none, to cover single-character usernames. (?:…) is for non-capturing grouping.
  4. \b: the whole pattern must end at a word break (which includes -).
like image 34
Honore Doktorr Avatar answered Oct 13 '22 01:10

Honore Doktorr


I am using this simple RegEx I created to grab github usernames from a google forms and it works pretty decently (with one very rare caveat):

^@\w(-\w|\w\w|\w){0,19}$

Where:

  • ^: starting of the line
  • @ and -: the symbols at and dash themselves.
  • \w: [A-Za-z0-9_], numbers, letters (both cases) and underlines
  • $: end of the line
  • {0,19}: repeat the parenthesis before it from zero to nineteen times

To summarize:

  • The matched RegEx must be an entire line (from ^ to $)
  • It will start with an @ followed by a letter (both cases), number or underline (@A, @1 or @_)
  • Then it will follow one of the three options in the repetition pattern (...){0,19}:

    • a dash and a \w (1st opt)
    • two \w (2nd opt)
    • a single \w (3rd opt)

    This will repeat and give the following patterns:

  • Zero times: a single letter username

  • One time: it can be a two letter username, or three letters, or three characters with a dash in the middle @w-w
  • More times: it guarantees that the dash is never in the begin or end, also not duplicated, being anywhere else.
  • 19 times: if using only 1st and 2nd options, it gives a maximum of 19*2=38 characters, plus the one in the begin equals to 39 characters total. If using anytime the third option, the total size would be smaller.

Caveat:

  • It does not recognize patterns with @ww-w...w (a dash in the third letter and with 39 characters).
  • Although it do recognize the pattern @ww-w...w if the size is less than 39 characters.

The problem is that to achieve ww-w the pattern is broke down as the first w standing alone, followed by a single w as the third option in the repeated expression (which leaves only 18 to go), followed by another repetition as w- (the first option, leaving only 17 to go), and then, with this 17 left, we can only get 17*2=34 characters. That means, the maximum would be 38 (34+2+1+1) characters, not 39.

But that is really ok for my purposes, so if you need simplicity, here it is a RegEx that can give you pretty good answers. I hope it helps you understand it when translating to javascript.

like image 41
DrBeco Avatar answered Oct 12 '22 23:10

DrBeco