Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex for detecting arbitrary alphanumerics (with possible special characters in between) that are *not* purely alphabetical

Tags:

python

regex

Let's say we have a string with free text, most of which is words but some entries which are numbers or serial numbers or anything of the sort:

text == """My name is Maximus Awesomeus and my phone number is +13204919920, my sort code is 01-42-42 and my ID is ZUI8012IOI1. Here is a random string that shouldn't be caught: UHAHS-IQOEQI but here is a random string that should be caught IAIUH124242JOOO-1213IH/131IOIHIO"""

In a Regex search I would like to ignore all words and basically find anything that could be a serial number or number or anything of the sort. In this case that would be:

+13204919920, 01-42-42, ZUI8012IOI1, IAIUH124242JOOO-1213IH/131IOIHIO

I came up with this pattern:

\b(?=.*\d)[A-Za-z0-9._@#/-+]+\b

But the look ahead, looks through the entire string and thus the purely alphabetical words get caught as well if there's even a single number in the rest of the string. I'm not sure how to get around that - regex has never been a strong suit.

like image 691
MergeMonster Avatar asked Aug 31 '25 03:08

MergeMonster


2 Answers

Instead of using a word boundary, you could assert a whitespace boundary to the left.

Then instead of asserting a digit, you can match at least a single digit by excluding it first from the first character class.

The hyphen in the character class should be escaped or placed at the beginning/end

(?<!\S)[A-Za-z._@#/+-]*[0-9][A-Za-z0-9._@#/+-]*[A-Za-z0-9_@#/+-]

The pattern matches:

  • (?<!\S) Assert whitespace boundary to the left
  • [A-Za-z._@#/+-]* Match 0+ times any of the allowed characters without a digit
  • [0-9] Match a single digit
  • [A-Za-z0-9._@#/+-]* Match 0+ times any of the allowed characters including a digit
  • [A-Za-z0-9_@#/+-] Match a single char from the character class without a dot

See a regex demo

Or shorter:

(?<!\S)[a-zA-Z_.@#/+-]*\d[\w.@#/+-]*\w
like image 102
The fourth bird Avatar answered Sep 02 '25 16:09

The fourth bird


(?:^|(?<=\s))(?=\S*\d)[A-Za-z0-9._@#\/+-]+\b

Your regex seems to be working fine with a small change.

See demo.

https://regex101.com/r/hX4jMa/1

The lookahead was spanning multiple words and that was the issue.

like image 36
vks Avatar answered Sep 02 '25 17:09

vks