Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to catch roman numbers inside string?

Tags:

regex

perl

I want to catch roman numbers inside string (numbers below 80 is fine enough). I found good base for it in How do you match only valid roman numerals with a regular expression?. Problem is: it deals with whole strings. I did not found yet a solution how to detect roman numbers inside string, because there is nothing mandatory, every group may be optional. So far i tried something like this:

my $x = ' some text I-LXIII iv more ';

if (  $x =~  s/\b(
                    (
                        (XC|XL|L?X{0,3}) # first group 10-90
                    |
                        (IX|IV|V?I{0,3}) # second group 1-9
                    )+
            )
        \b/>$1</xgi ) { # mark every occurrence
    say $x;
}

__END__
 ><some>< ><text>< ><>I<><-><>LXIII<>< ><>iv<>< ><more>< 
 desired output:
  some text >I<->LXIII< >iv< more 

So, this one captures word boundaries by themself too, because all groups are optional. How to get it done? How to make one of those 2 groups mandatory while there is no possible to tell which one is mandatory? Other approaches to catch romans are welcome too.

like image 210
w.k Avatar asked Oct 18 '12 08:10

w.k


2 Answers

You can use Roman CPAN module

use Roman;

my $x = ' some text I-LXIII VII XCVI IIIXII iv more ';
if (  $x =~  
    s/\b
    (
        [IVXLC]+
    )
    \b
    /isroman($1) ? ">$1<" : $1/exgi ) {
    say $x;
}

output:

some text >I<->LXIII< >VII< >XCVI< IIIXII >iv< more 
like image 92
Toto Avatar answered Nov 01 '22 11:11

Toto


This is where Perl lets us down with its missing \< and \> (beginning and end word boundary) constructs that are available elsewhere. A pattern like \b...\b will match even if the ... consumes none of the target string because the second \b will happily match the beginning word boundary a second time.

However an end word boundary is just (?<=\w)(?!\w) so we can use this instead.

This program will do what you want. It does a look-ahead for a string of potential Roman characters enclosed in word boundaries (so we must be at a beginning word boundary) and then checks for a legal Roman number that isn't followed by a word character (so now we're at an end word boundary).

Note that I've reversed your >...< marks as they were confusing me.

use strict;
use warnings;

use feature 'say';

my $x = ' some text I-LXIII iv more ';

if ( $x =~ s{
    (?= \b [CLXVI]+ \b )
    (
      (?:XC|XL|L?X{0,3})?
      (?:IX|IV|V?I{0,3})?
    )
    (?!\w)
    }
    {<$1>}xgi ) {

    say $x;
}

output

some text <I>-<LXIII> <iv> more 
like image 33
Borodin Avatar answered Nov 01 '22 12:11

Borodin