Match one of two lookbehinds

Question

I'm trying to populate a column in a Pandas.DataFrame by extracting the id of a device from a log file. The problem is that id may be preceded by two separate patterns as follows:

Pattern 1:

(?<=cameraId=\')([a-z0-9-]+))

Pattern 2:

(?<=/live/)([a-z0-9-]+)

Note: there is no way for a line to have both of the patterns

The problem is that I use the Pandas.String.str.findall() method, and I want both the patterns to be populated.

I can successfully achieve the desired outcome as shown in the code below:

import pandas as pd

line_1 = 'INFO:2021-04-19 00:25:10,647:instance_manager.py:MainProcess:1:got event notificationName=\'DETECTION_STARTED\' cameraId=\'ab1c-ab6c-a6f6-a6d6-ab666\' timestamp=\'2021-04-19T00:24:08.192169Z\''

line_2 = 'INFO:2021-04-19 00:25:11,278:instance_manager.py:MainProcess:1:An old record record for the stream rtsp://127.0.1.1:6666/live/a001-a00a-0016-a006-ab606.stream was successfully updated in the DB!'

df = pd.DataFrame(columns=['type', 'ts', 'process', 'subprocess', 'line', 'message'])

line_1_parsed = pd.Series([line_1]).str.extract(r'(?P<type>[^:]+):(?P<ts>.+,\d+):(?P<process>[^:]+):(?P<subprocess>[^:]+):(?P<line>[^:]+):(?P<message>[^$]+)')
line_2_parsed = pd.Series([line_2]).str.extract(r'(?P<type>[^:]+):(?P<ts>.+,\d+):(?P<process>[^:]+):(?P<subprocess>[^:]+):(?P<line>[^:]+):(?P<message>[^$]+)')

df =df.append(line_1_parsed, ignore_index=True)
df =df.append(line_2_parsed, ignore_index=True)

df.loc[:, 'cam_id'] = df.loc[:, 'message'].str.findall('(?<=cameraId=\')([a-z0-9-]+)|(?<=/live/)([a-z0-9-]+)')
df

, but they are returned as tuples (pattern 1, pattern 2) as shown in the Current Output:

Current Output:

    type    ts  process     subprocess  line    message     cam_id
0   INFO    2021-04-19 00:25:10,647     instance_manager.py     MainProcess     1   got event notificationName='DETECTION_STARTED'...   [(ab1c-ab6c-a6f6-a6d6-ab666, )]
1   INFO    2021-04-19 00:25:11,278     instance_manager.py     MainProcess     1   An old record record for the stream rtsp://127...   [(, a001-a00a-0016-a006-ab606)]

I do understand that this is caused by the fact that it tries both pattern and returns the matches for both, but I'd rather would like it to have only the successful pattern in.

Sure, I can do it by manually extracting it in the following manner:

df.loc[:, 'cam_id'] = df.loc[:, 'cam_id'].apply(lambda cam_id_tuple: cam_id_tuple[0][0] if cam_id_tuple[0][0] != '' else cam_id_tuple[0][1])
df

but it is rather a cumbersome solution, and not extendable, in case I'd like to add patterns.

Desired Output:

    type    ts  process     subprocess  line    message     cam_id
0   INFO    2021-04-19 00:25:10,647     instance_manager.py     MainProcess     1   got event notificationName='DETECTION_STARTED'...   [ab1c-ab6c-a6f6-a6d6-ab666]
1   INFO    2021-04-19 00:25:11,278     instance_manager.py     MainProcess     1   An old record record for the stream rtsp://127...   [a001-a00a-0016-a006-ab606]`

Nonte: the cam_id column contains strings and not tuples

Thanks in advance.

Shubham Sharma · Accepted Answer

We can use str.extract with a regex pattern having a single capturing group

df['message'].str.extract(r'(?:cameraId=\'|/live/)([a-z0-9-]+)', expand=False)

0    ab1c-ab6c-a6f6-a6d6-ab666
1    a001-a00a-0016-a006-ab606
Name: message, dtype: object

Regex details:

(?:cameraId=\'|/live/): Non capturing group
- cameraId=\' : First alternative matches the characters cameraId=' literally
- /live/ : Second alternative matches the characters /live/ literally
([a-z0-9-]+) : First capturing group
- [a-z0-9-]+ : Matches any character present in the list [a-z0-9-] one or more times

See the online regex demo

Match one of two lookbehinds

Tags:

regex

python-3.x

pandas

Michael

1 Answers

Shubham Sharma

Recent Activity

Donate For Us

Match one of two lookbehinds

Tags:

regex

python-3.x

pandas

Michael

1 Answers

Shubham Sharma

Related questions

Recent Activity

Donate For Us