I'm trying to parse GitHub usernames (that start with @) from a paragraph of text in order to link them to their associated profiles.
The GitHub username constraints are:
For example, the following text:
Example @valid hello @valid-username: @another-valid-username, @-invalid @in--valid @ignore-last-dash- [email protected] @another-valid?
The script...
Should match:
Should ignore:
I'm getting reasonably close with JavaScript by using:
/\B@((?!.*(-){2,}.*)[a-z0-9][a-z0-9-]{0,38}[a-z0-9])/ig
But this isn't matching usernames with a single character (such as @a).
Here are my tests to far: https://regex101.com/r/rZ5eW1/2
Is the current regex efficient? And how can I match a single non-hyphen character?
/\B@([a-z0-9](?:-(?=[a-z0-9])|[a-z0-9]){0,38}(?<=[a-z0-9]))/gi
Note: When this regex runs into a character or set of characters that can't be in a username (i.e. .
, --
), it matches from @
up until that stopping point. OP says that's fine so I'm rolling with it. So, if the underline is the matched area (NOT the captured area):
@abc.123
@abc--123
@abc-
This works by using lots of nested groups. Regex101 has a fantastic breakdown, but here's mine anyway:
\B
: This is a builtin means 'not a word boundary', which seems to do the trick, though it may be problematic if something like [email protected]
is a valid email address. At that point, though, it's indistinguishable from the text of someone who doesn't put spaces after punctuation[1] when they start a sentence with an @reference.Thanks to Honore Doktorr for pointing out that negative lookbehinds don't exist in JS.
@
: Just the literal @
symbol. One of the few places where a character means what it is.(...)
: The capturing group. The way it's placed means that it won't capture the @
symbol, it'll just match it, so it's easier to get the username -- no need to get a substring.[a-z0-9]
: A character class to match any letter or number. Because of the i
flag, this also matches capital letters. Because it's the first letter, it must be present.(?:...)
: This is a noncapturing group. It wraps a block of regex in a group without capturing it....|...
We have two alternatives, which are...-(?=[a-z0-9])
: A hyphen, followed immediately by a non-hyphen valid character.[a-z0-9]
: A valid non-hyphen character.{0,38}
: Match the noncapturing group between 0 and 38 times, inclusive. Combined with #4, this gives us 39 letters maximum. Anything beyond that will be ignored.(?<=[a-z0-9])
: This is a positive lookbehind, which JS does support. It ensures the last character isn't a -
-- or rather, is a valid character except hyphen.This could be 'optimized' a few ways, but honestly, I'd probably use a much simpler regex and do some validation after-the-fact on it, e.g.:
// somehow get the prospective username into `user`
if (user.startsWith('-')) { /* reject */ }
if (user.endsWith('-')) { /* reject */ }
if (user.contains('--')) { /* reject */ }
At a bare minimum, explain the regex in your code. Feel free to copy-paste mine with credit.
This expression will also match your one-word usernames.
/\B@(?!.*(-){2,}.*)[a-z0-9](?:[a-z0-9-]{0,37}[a-z0-9])?\b/ig
Sample. Explanation:
(?!.*(-){2,}.*)
: your negative lookahead asserts that the rest of the pattern can’t contain two or more adjacent dashes.[a-z0-9]
: there must be one alphanumeric character immediately after @
.(?:[a-z0-9-]{0,37}[a-z0-9])?
: there may be anywhere from 0–37 alphanumeric characters or dashes, followed by one alphanumeric character, immediately after #2’s pattern — or there may be none, to cover single-character usernames. (?:…)
is for non-capturing grouping.
\b
: the whole pattern must end at a word break (which includes -
).I am using this simple RegEx
I created to grab github usernames from a google forms and it works pretty decently (with one very rare caveat):
^@\w(-\w|\w\w|\w){0,19}$
Where:
^
: starting of the line@
and -
: the symbols at and dash themselves.\w
: [A-Za-z0-9_], numbers, letters (both cases) and underlines$
: end of the line{0,19}
: repeat the parenthesis before it from zero to nineteen timesTo summarize:
RegEx
must be an entire line (from ^
to $
)@
followed by a letter (both cases), number or underline (@A
, @1
or @_
)Then it will follow one of the three options in the repetition pattern (...){0,19}
:
\w
(1st opt)\w
(2nd opt)\w
(3rd opt)This will repeat and give the following patterns:
Zero times: a single letter username
@w-w
19*2=38
characters, plus the one in the begin equals to 39
characters total. If using anytime the third option, the total size would be smaller.Caveat:
@ww-w...w
(a dash in the third letter and with 39 characters). @ww-w...w
if the size is less than 39 characters.The problem is that to achieve ww-w
the pattern is broke down as the first w
standing alone, followed by a single w
as the third option in the repeated expression (which leaves only 18 to go), followed by another repetition as w-
(the first option, leaving only 17 to go), and then, with this 17 left, we can only get 17*2=34
characters. That means, the maximum would be 38 (34+2+1+1
) characters, not 39.
But that is really ok for my purposes, so if you need simplicity, here it is a RegEx
that can give you pretty good answers. I hope it helps you understand it when translating to javascript
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With