I have a string like the following:
<118>date=2010-05-09,time=16:41:27,device_id=FE-2KA3F09000049,log_id=0400147717,log_part=00,type=statistics,subtype=n/a,pri=information,session_id=o49CedRc021772,from="[email protected]",mailer="mta",client_name="example.org,[194.177.17.24]",resolved=OK,to="[email protected]",direction="in",message_length=6832079,virus="",disposition="Accept",classifier="Not,Spam",subject="=?windows-1255?B?Rlc6IEZ3OiDg5fDp5fog+fno5fog7Pf46eHp7S3u4+Tp7SE=?="
I tried using CSV module and it didn't fit, cause i haven't found a way to ignore what's quoted. Pyparsing looked like a better answer but i haven't found a way to declare all the grammars.
Currently, i am using my old Perl script to parse it, but i want this written in Python. if you need my Perl snippet i will be glad to provide it.
Any help is appreciated.
It might be better to leverage an existing parser than to use ad-hoc regexs.
parse_http_list(s) Parse lists as described by RFC 2068 Section 2. In particular, parse comma-separated lists where the elements of the list may include quoted-strings. A quoted-string could contain a comma. A non-quoted string could have quotes in the middle. Neither commas nor quotes count if they are escaped. Only double-quotes count, not single-quotes. parse_keqv_list(l) Parse list of key=value strings where keys are not duplicated.
Example:
>>> pprint.pprint(urllib2.parse_keqv_list(urllib2.parse_http_list(s)))
{'<118>date': '2010-05-09',
'classifier': 'Not,Spam',
'client_name': 'example.org,[194.177.17.24]',
'device_id': 'FE-2KA3F09000049',
'direction': 'in',
'disposition': 'Accept',
'from': '[email protected]',
'log_id': '0400147717',
'log_part': '00',
'mailer': 'mta',
'message_length': '6832079',
'pri': 'information',
'resolved': 'OK',
'session_id': 'o49CedRc021772',
'subject':'=?windows-1255?B?Rlc6IEZ3OiDg5fDp5fog+fno5fog7Pf46eHp7S3u4+Tp7SE=?=',
'subtype': 'n/a',
'time': '16:41:27',
'to': '[email protected]',
'type': 'statistics',
'virus': ''}
I'm not sure what you're really looking for, but
import re
data = "date=2010-05-09,time=16:41:27,device_id=FE-2KA3F09000049,log_id=0400147717,log_part=00,type=statistics,subtype=n/a,pri=information,session_id=o49CedRc021772,from=\"[email protected]\",mailer=\"mta\",client_name=\"example.org,[194.177.17.24]\",resolved=OK,to=\"[email protected]\",direction=\"in\",message_length=6832079,virus=\"\",disposition=\"Accept\",classifier=\"Not,Spam\",subject=\"=?windows-1255?B?Rlc6IEZ3OiDg5fDp5fog+fno5fog7Pf46eHp7S3u4+Tp7SE=?=\""
pattern = r"""(\w+)=((?:"(?:\\.|[^\\"])*"|'(?:\\.|[^\\'])*'|[^\\,"'])+)"""
print(re.findall(pattern, data))
gives you
[('date', '2010-05-09'), ('time', '16:41:27'), ('device_id', 'FE-2KA3F09000049'),
('log_id', '0400147717'), ('log_part', '00'), ('type', 'statistics'),
('subtype', 'n/a'), ('pri', 'information'), ('session_id', 'o49CedRc021772'),
('from', '"[email protected]"'), ('mailer', '"mta"'),
('client_name', '"example.org,[194.177.17.24]"'), ('resolved', 'OK'),
('to', '"[email protected]"'), ('direction', '"in"'),
('message_length', '6832079'), ('virus', '""'), ('disposition', '"Accept"'),
('classifier', '"Not,Spam"'),
('subject', '"=?windows-1255?B?Rlc6IEZ3OiDg5fDp5fog+fno5fog7Pf46eHp7S3u4+Tp7SE=?="')
]
You might want to clean up the quoted strings afterwards (using mystring.strip("'\"")
).
EDIT: This regex now also correctly handles escaped quotes inside quoted strings (a="She said \"Hi!\""
).
Explanation of the regex:
(\w+)=((?:"(?:\\.|[^\\"])*"|'(?:\\.|[^\\'])*'|[^\\,"'])+)
(\w+)
: Match the identifier and capture it into backreference no. 1
=
: Match a =
(
: Capture the following into backreference no. 2:
(?:
: One of the following:
"(?:\\.|[^\\"])*"
: A double quote, followed by either zero or more of the following: an escaped character or a non-quote/non-backslash character, followed by another double quote
|
: or
'(?:\\.|[^\\'])*'
: See above, just for single quotes.
|
: or
[^\\,"']
: one character that is neither a backslash, a comma, nor a quote.
)+
: repeat at least once, as many times as possible.
)
: end of capturing group no. 2.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With