How do I extract the IP address that occurs 10 times within a one-second time interval?
In the following case:
241.7118.197.10
28.252.8
You could collect the data to dict
where IP is key and value contains timestamps for given IP. Then every time when timestamp is added you could check if given IP has three timestamps within a second:
from datetime import datetime, timedelta
from collections import defaultdict, deque
import re
THRESHOLD = timedelta(seconds=1)
COUNT = 3
res = set()
d = defaultdict(deque)
with open('test.txt') as f:
for line in f:
# Capture IP and timestamp
m = re.match(r'(\S*)[^\[]*\[(\S*)', line)
ip, dt = m.groups()
# Parse timestamp
dt = datetime.strptime(dt, '%d/%b/%Y:%H:%M:%S:%f')
# Remove timestamps from deque if they are older than threshold
que = d[ip]
while que and (dt - que[0]) > THRESHOLD:
que.popleft()
# Add timestamp, update result if there's 3 or more items
que.append(dt)
if len(que) >= COUNT:
res.add(ip)
print(res)
Result:
{'28.252.89.140'}
Above reads the logfile containing the log line by line. For every line a regular expression is used to capture data in two groups: IP and timestamp. Then strptime
is used to parse the time.
First group (\S*)
captures everything but whitespace. Then [^\[]*
captures everything except [
and \[
captures the final character before timestamp. Finally (\S*)
is used again to capture everything until next whitespace. See example on regex101.
Once we have IP and time they are added to defaultdict
where IP is used as key and value is deque
of timestamps. Before new timestamp is added the old ones are removed if they are older than THRESHOLD
. This assumes that log lines are already sorted by time. After the addition the length is checked and if there are COUNT
or more items in the queue IP is added to result set.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With