Roman numerals are written using the letters M, D, C, L, X, V, and I, representing the values 1000, 500, 100, 50, 10, 5, and 1, respectively. The first regex matches any string composed of these letters, without checking whether the letters appear in the order or quantity necessary to form a proper Roman numeral.
To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).
To match any number from 0 to 9 we use \d in regex. It will match any single digit number from 0 to 9. \d means [0-9] or match any number from 0 to 9. Instead of writing 0123456789 the shorthand version is [0-9] where [] is used for character range.
You can use the following regex for this:
^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$
Breaking it down, M{0,4}
specifies the thousands section and basically restrains it to between 0
and 4000
. It's a relatively simple:
0: <empty> matched by M{0}
1000: M matched by M{1}
2000: MM matched by M{2}
3000: MMM matched by M{3}
4000: MMMM matched by M{4}
You could, of course, use something like M*
to allow any number (including zero) of thousands, if you want to allow bigger numbers.
Next is (CM|CD|D?C{0,3})
, slightly more complex, this is for the hundreds section and covers all the possibilities:
0: <empty> matched by D?C{0} (with D not there)
100: C matched by D?C{1} (with D not there)
200: CC matched by D?C{2} (with D not there)
300: CCC matched by D?C{3} (with D not there)
400: CD matched by CD
500: D matched by D?C{0} (with D there)
600: DC matched by D?C{1} (with D there)
700: DCC matched by D?C{2} (with D there)
800: DCCC matched by D?C{3} (with D there)
900: CM matched by CM
Thirdly, (XC|XL|L?X{0,3})
follows the same rules as previous section but for the tens place:
0: <empty> matched by L?X{0} (with L not there)
10: X matched by L?X{1} (with L not there)
20: XX matched by L?X{2} (with L not there)
30: XXX matched by L?X{3} (with L not there)
40: XL matched by XL
50: L matched by L?X{0} (with L there)
60: LX matched by L?X{1} (with L there)
70: LXX matched by L?X{2} (with L there)
80: LXXX matched by L?X{3} (with L there)
90: XC matched by XC
And, finally, (IX|IV|V?I{0,3})
is the units section, handling 0
through 9
and also similar to the previous two sections (Roman numerals, despite their seeming weirdness, follow some logical rules once you figure out what they are):
0: <empty> matched by V?I{0} (with V not there)
1: I matched by V?I{1} (with V not there)
2: II matched by V?I{2} (with V not there)
3: III matched by V?I{3} (with V not there)
4: IV matched by IV
5: V matched by V?I{0} (with V there)
6: VI matched by V?I{1} (with V there)
7: VII matched by V?I{2} (with V there)
8: VIII matched by V?I{3} (with V there)
9: IX matched by IX
Just keep in mind that that regex will also match an empty string. If you don't want this (and your regex engine is modern enough), you can use positive look-behind and look-ahead:
(?<=^)M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})(?=$)
(the other alternative being to just check that the length is not zero beforehand).
Actually, your premise is flawed. 990 IS "XM", as well as "CMXC".
The Romans were far less concerned about the "rules" than your third grade teacher. As long as it added up, it was OK. Hence "IIII" was just as good as "IV" for 4. And "IIM" was completely cool for 998.
(If you have trouble dealing with that... Remember English spellings were not formalized until the 1700s. Until then, as long as the reader could figure it out, it was good enough).
Just to save it here:
(^(?=[MDCLXVI])M*(C[MD]|D?C{0,3})(X[CL]|L?X{0,3})(I[XV]|V?I{0,3})$)
Matches all the Roman numerals. Doesn't care about empty strings (requires at least one Roman numeral letter). Should work in PCRE, Perl, Python and Ruby.
Online Ruby demo: http://rubular.com/r/KLPR1zq3Hj
Online Conversion: http://www.onlineconversion.com/roman_numerals_advanced.htm
To avoid matching the empty string you'll need to repeat the pattern four times and replace each 0
with a 1
in turn, and account for V
, L
and D
:
(M{1,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|C?D|D?C{1,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3}))
In this case (because this pattern uses ^
and $
) you would be better off checking for empty lines first and don't bother matching them. If you are using word boundaries then you don't have a problem because there's no such thing as an empty word. (At least regex doesn't define one; don't start philosophising, I'm being pragmatic here!)
In my own particular (real world) case I needed match numerals at word endings and I found no other way around it. I needed to scrub off the footnote numbers from my plain text document, where text such as "the Red Seacl and the Great Barrier Reefcli" had been converted to the Red Seacl and the Great Barrier Reefcli
. But I still had problems with valid words like Tahiti
and fantastic
are scrubbed into Tahit
and fantasti
.
Fortunately, the range of numbers is limited to 1..3999 or thereabouts. Therefore, you can build up the regex piece-meal.
<opt-thousands-part><opt-hundreds-part><opt-tens-part><opt-units-part>
Each of those parts will deal with the vagaries of Roman notation. For example, using Perl notation:
<opt-hundreds-part> = m/(CM|DC{0,3}|CD|C{1,3})?/;
Repeat and assemble.
Added: The <opt-hundreds-part>
can be compressed further:
<opt-hundreds-part> = m/(C[MD]|D?C{0,3})/;
Since the 'D?C{0,3}' clause can match nothing, there's no need for the question mark. And, most likely, the parentheses should be the non-capturing type - in Perl:
<opt-hundreds-part> = m/(?:C[MD]|D?C{0,3})/;
Of course, it should all be case-insensitive, too.
You can also extend this to deal with the options mentioned by James Curran (to allow XM or IM for 990 or 999, and CCCC for 400, etc).
<opt-hundreds-part> = m/(?:[IXC][MD]|D?C{0,4})/;
import re
pattern = '^M{0,3}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$'
if re.search(pattern, 'XCCMCI'):
print 'Valid Roman'
else:
print 'Not valid Roman'
For people who really want to understand the logic, please take a look at a step by step explanation on 3 pages on diveintopython.
The only difference from original solution (which had M{0,4}
) is because I found that 'MMMM' is not a valid Roman numeral (also old Romans most probably have not thought about that huge number and will disagree with me). If you are one of disagreing old Romans, please forgive me and use {0,4} version.
In my case, I was trying to find and replace all occurences of roman numbers by one word inside the text, so I couldn't use the start and end of lines. So the @paxdiablo solution found many zero-length matches. I ended up with the following expression:
(?=\b[MCDXLVI]{1,6}\b)M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3})
My final Python code was like this:
import re
text = "RULES OF LIFE: I. STAY CURIOUS; II. NEVER STOP LEARNING"
text = re.sub(r'(?=\b[MCDXLVI]{1,6}\b)M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3})', 'ROMAN', text)
print(text)
Output:
RULES OF LIFE: ROMAN. STAY CURIOUS; ROMAN. NEVER STOP LEARNING
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With