I'm trying to populate a column in a Pandas.DataFrame by extracting the id of a device from a log file.
The problem is that id may be preceded by two separate patterns as follows:
Pattern 1:
(?<=cameraId=\')([a-z0-9-]+))
Pattern 2:
(?<=/live/)([a-z0-9-]+)
Note: there is no way for a line to have both of the patterns
The problem is that I use the Pandas.String.str.findall() method, and I want both the patterns to be populated.
I can successfully achieve the desired outcome as shown in the code below:
import pandas as pd
line_1 = 'INFO:2021-04-19 00:25:10,647:instance_manager.py:MainProcess:1:got event notificationName=\'DETECTION_STARTED\' cameraId=\'ab1c-ab6c-a6f6-a6d6-ab666\' timestamp=\'2021-04-19T00:24:08.192169Z\''
line_2 = 'INFO:2021-04-19 00:25:11,278:instance_manager.py:MainProcess:1:An old record record for the stream rtsp://127.0.1.1:6666/live/a001-a00a-0016-a006-ab606.stream was successfully updated in the DB!'
df = pd.DataFrame(columns=['type', 'ts', 'process', 'subprocess', 'line', 'message'])
line_1_parsed = pd.Series([line_1]).str.extract(r'(?P<type>[^:]+):(?P<ts>.+,\d+):(?P<process>[^:]+):(?P<subprocess>[^:]+):(?P<line>[^:]+):(?P<message>[^$]+)')
line_2_parsed = pd.Series([line_2]).str.extract(r'(?P<type>[^:]+):(?P<ts>.+,\d+):(?P<process>[^:]+):(?P<subprocess>[^:]+):(?P<line>[^:]+):(?P<message>[^$]+)')
df =df.append(line_1_parsed, ignore_index=True)
df =df.append(line_2_parsed, ignore_index=True)
df.loc[:, 'cam_id'] = df.loc[:, 'message'].str.findall('(?<=cameraId=\')([a-z0-9-]+)|(?<=/live/)([a-z0-9-]+)')
df
, but they are returned as tuples (pattern 1, pattern 2) as shown in the Current Output:
Current Output:
type ts process subprocess line message cam_id
0 INFO 2021-04-19 00:25:10,647 instance_manager.py MainProcess 1 got event notificationName='DETECTION_STARTED'... [(ab1c-ab6c-a6f6-a6d6-ab666, )]
1 INFO 2021-04-19 00:25:11,278 instance_manager.py MainProcess 1 An old record record for the stream rtsp://127... [(, a001-a00a-0016-a006-ab606)]
I do understand that this is caused by the fact that it tries both pattern and returns the matches for both, but I'd rather would like it to have only the successful pattern in.
Sure, I can do it by manually extracting it in the following manner:
df.loc[:, 'cam_id'] = df.loc[:, 'cam_id'].apply(lambda cam_id_tuple: cam_id_tuple[0][0] if cam_id_tuple[0][0] != '' else cam_id_tuple[0][1])
df
but it is rather a cumbersome solution, and not extendable, in case I'd like to add patterns.
Desired Output:
type ts process subprocess line message cam_id
0 INFO 2021-04-19 00:25:10,647 instance_manager.py MainProcess 1 got event notificationName='DETECTION_STARTED'... [ab1c-ab6c-a6f6-a6d6-ab666]
1 INFO 2021-04-19 00:25:11,278 instance_manager.py MainProcess 1 An old record record for the stream rtsp://127... [a001-a00a-0016-a006-ab606]`
Nonte: the cam_id column contains strings and not tuples
Thanks in advance.
We can use str.extract with a regex pattern having a single capturing group
df['message'].str.extract(r'(?:cameraId=\'|/live/)([a-z0-9-]+)', expand=False)
0 ab1c-ab6c-a6f6-a6d6-ab666
1 a001-a00a-0016-a006-ab606
Name: message, dtype: object
Regex details:
(?:cameraId=\'|/live/): Non capturing group
cameraId=\' : First alternative matches the characters cameraId=' literally/live/ : Second alternative matches the characters /live/ literally([a-z0-9-]+) : First capturing group
[a-z0-9-]+ : Matches any character present in the list [a-z0-9-] one or more timesSee the online regex demo
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With