Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python regex to match 6-digit numbers of different formats

Tags:

python

regex

I need to find and capture all occurrences of 6 digit numbers with OMIM and MIM prefixes and all 6 digit numbers where there is no preceding colon.

Expected output

[111111, 222222, 555555, 444444]

What I have tried

import re

sentence = '111111;Dystonia-1,222222,OMIM:555555; 3333333 Dystonic disorder1,MIM#444444'

re1 = r'OMIM:(\d{6})'
re2 = r'MIM#(\d{6})'
re3 = r'[^:](\d{6})'

identifiers = re.compile("(%s|%s|%s)" % (re1, re2, re3)).findall(sentence)

Current output

[
  ( ',222222'     , ''       , ''       , '222222' ),
  ( 'OMIM:555555' , '555555' , ''       , ''       ),
  ( ' 333333'     , ''       , ''       , '333333' ),
  ( 'MIM#444444'  , ''       , '444444' , ''       )
]
like image 359
vendrediSurMer Avatar asked Feb 17 '21 14:02

vendrediSurMer


People also ask

How do you match digits in Python?

Python Regex Metacharacters [0-9] matches any single decimal digit character—any character between '0' and '9' , inclusive. The full expression [0-9][0-9][0-9] matches any sequence of three decimal digit characters. In this case, s matches because it contains three consecutive decimal digit characters, '123' .

How do you find digits in RegEx?

\d for single or multiple digit numbers It will match any single digit number from 0 to 9. \d means [0-9] or match any number from 0 to 9. Instead of writing 0123456789 the shorthand version is [0-9] where [] is used for character range. [1-9][0-9] will match double digit number from 10 to 99.

What is string pattern matching in Python?

Regular expression in a python programming language is a method used for matching text pattern. The “re” module which comes with every python installation provides regular expression support. In python, a regular expression search is typically written as: match = re.search(pattern, string)


1 Answers

I think you could try:

\b(?:MIM#|OMIM:|(?<!:))(\d{6})\b

See the online demo

  • \b - Word boundary.
  • (?: - Non-capture group:
    • MIM#|OMIM:|(?<!:) - Literally "MIM#" or "OMIM:" or a negative lookbehind to assert position is not preceded by a colon.
    • ). Close non-capture group.
  • (\d{6}) - Capture six digits in a capture group.
  • \b - Word boundary.

import re
sentence = '111111;Dystonia-1,222222,OMIM:555555; 3333333 Dystonic disorder1,MIM#444444'
print(re.findall(r'\b(?:MIM#|OMIM:|(?<!:))(\d{6})\b', sentence))

Prints:

['111111', '222222', '555555', '444444']
like image 140
JvdV Avatar answered Oct 12 '22 23:10

JvdV