Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python3: extract IP address from compiled pattern

I want to process every line in my log file, and extract IP address if line matches my pattern. There are several different types of messages, in example below I am using p1andp2`.

I could read the file line by line, and for each line match to each pattern. But Since there can be many more patterns, I would like to do it as efficiently as possible. I was hoping to compile thos patterns into one object, and do the match only once for each line:

import re

IP = r'(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'

p1 = 'Registration from' + IP + '- Wrong password' 
p2 = 'Call from' + IP + 'rejected because extension not found'

c = re.compile(r'(?:' + p1 + '|' + p2 + ')')

for line in sys.stdin:
    match = re.search(c, line)
    if match:
        print(match['ip'])

but the above code does not work, it complains that ip is used twice.

What is the most elegant way to achieve my goal ?

EDIT:

I have modified my code based on answer from @Dev Khadka.

But I am still struggling with how to properly handle the multiple ip matches. The code below prints all IPs that matched p1:

for line in sys.stdin:
    match = c.search(line)
    if match:
        print(match['ip1'])

But some lines don't match p1. They match p2. ie, I get:

1.2.3.4
None
2.3.4.5
...

How do I print the matching ip, when I don't know wheter it was p1, p2, ... ? All I want is the IP. I don't care which pattern it matched.

like image 507
Martin Vegter Avatar asked Oct 16 '19 15:10

Martin Vegter


2 Answers

You can consider installing the excellent regex module, which supports many advanced regex features, including branch reset groups, designed to solve exactly the problem you outlined in this question. Branch reset groups are denoted by (?|...). All capture groups of the same positions or names in different alternative patterns within a branch reset grouop share the same capture groups for output.

Notice that in the example below the matching capture group becomes the named capture group, so that you don't need to iterate over multiple groups searching for a non-empty group:

import regex

ip_pattern = r'(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'
patterns = [
    'Registration from {ip} - Wrong password',
    'Call from {ip} rejected because extension not found'
]
pattern = regex.compile('(?|%s)' % '|'.join(patterns).format(ip=ip_pattern))
for line in sys.stdin:
    match = regex.search(pattern, line)
    if match:
        print(match['ip'])

Demo: https://repl.it/@blhsing/RegularEmbellishedBugs

like image 133
blhsing Avatar answered Oct 12 '22 01:10

blhsing


why don't you check which regex matched?

if 'ip1' in match :
    print match['ip1']
if 'ip2' in match :
    print match['ip2']

or something like:

names = [ 'ip1', 'ip2', 'ip3' ]
for n in names :
    if n in match :
        print match[n]

or even

num = 1000   # can easily handle millions of patterns =)
for i in range(num) :
    name = 'ip%d' % i
    if name in match :
        print match[name]
like image 35
lenik Avatar answered Oct 12 '22 01:10

lenik