Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python multiline regex

Tags:

python

regex

I'm having an issue compiling the correct regular expression for a multiline match. Can someone point out what I'm doing wrong. I'm looping through a basic dhcpd.conf file with hundreds of entries such as:

host node20007                                                                                                                  
{                                                                                                                              
    hardware ethernet 00:22:38:8f:1f:43;                                                                                       
    fixed-address node20007.domain.com;     
}

I've gotten various regex's to work for the MAC and fixed-address but cannot combine them to match properly.

f = open('/etc/dhcp3/dhcpd.conf', 'r')
re_hostinfo = re.compile(r'(hardware ethernet (.*))\;(?:\n|\r|\r\n?)(.*)',re.MULTILINE)

for host in f:
match = re_hostinfo.search(host)
    if match:
        print match.groups()

Currently my match groups will look like:
('hardware ethernet 00:22:38:8f:1f:43', '00:22:38:8f:1f:43', '')

But looking for something like:
('hardware ethernet 00:22:38:8f:1f:43', '00:22:38:8f:1f:43', 'node20007.domain.com')

like image 656
Joshua Avatar asked Jan 19 '11 21:01

Joshua


1 Answers

Update I've just noticed the real reason that you are getting the results that you got; in your code:

for host in f:
    match = re_hostinfo.search(host)
    if match:
        print match.groups()

host refers to a single line, but your pattern needs to work over two lines.

Try this:

data = f.read()
for x in regex.finditer(data):
    process(x.groups())

where regex is a compiled pattern that matches over two lines.

If your file is large, and you are sure that the pieces of interest are always spread over two lines, then you could read the file a line at a time, check the line for the first part of the pattern, setting a flag to tell you whether the next line should be checked for the second part. If you are not sure, it's getting complicated, maybe enough to start looking at the pyparsing module.

Now back to the original answer, discussing the pattern that you should use:

You don't need MULTILINE; just match whitespace. Build up your pattern using these building blocks:

(1) fixed text (2) one or more whitespace characters (3) one or more non-whitespace characters

and then put in parentheses to get your groups.

Try this:

>>> m = re.search(r'(hardware ethernet\s+(\S+));\s+\S+\s+(\S+);', data)
>>> print m.groups()
('hardware ethernet   00:22:38:8f:1f:43', '00:22:38:8f:1f:43', 'node20007.domain.com')
>>>

Please consider using "verbose mode" ... you can use it to document exactly which pieces of pattern match which pieces of data, and it can often help getting the pattern right in the first place. Example:

>>> regex = re.compile(r"""
... (hardware[ ]ethernet \s+
...     (\S+) # MAC
... ) ;
... \s+ # includes newline
... \S+ # variable(??) text e.g. "fixed-address"
... \s+
... (\S+) # e.g. "node20007.domain.com"
... ;
... """, re.VERBOSE)
>>> print regex.search(data).groups()
('hardware ethernet   00:22:38:8f:1f:43', '00:22:38:8f:1f:43', 'node20007.domain.com')
>>>
like image 57
John Machin Avatar answered Oct 12 '22 09:10

John Machin