I am trying to match key-value pairs that appear at the end of (long) strings. The strings look like (I replaced the "\n")
my_str = "lots of blah
key1: val1-words
key2: val2-words
key3: val3-words"
so I expect matches "key1: val1-words", "key2: val2-words" and "key3: val3-words".
I was thinking
re.compile('(?:tag1|tag2|tag3):')
plus some look-ahead assertion stuff would be a solution. I can't get it right though. How do I do?
Thank you.
/David
Real example string:
my_str = u'ucourt métrage pour kino session volume 18\nThème: O sombres héros\nContraintes: sous titrés\nAuthor: nicoalabdou\nTags: wakatanka productions court métrage kino session humour cantat bertrand noir désir sombres héros mer medine marie trintignant femme droit des femmes nicoalabdou pute soumise\nPosted: 06 June 2009\nRating: 1.3\nVotes: 3'
EDIT:
Based on Mikel's solution I am now using the following:
my_tags = ['\S+'] # gets all tags
my_tags = ['Tags','Author','Posted'] # selected tags
regex = re.compile(r'''
\n # all key-value pairs are on separate lines
( # start group to return
(?:{0}): # placeholder for tags to detect '\S+' == all
\s # the space between ':' and value
.* # the value
) # end group to return
'''.format('|'.join(my_tags)), re.VERBOSE)
regex.sub('',my_str) # return my_str without matching key-vaue lines
regex.findall(my_str) # return matched key-value lines
The correct regex to use is ^\d+$. Because “start of string” must be matched before the match of \d+, and “end of string” must be matched right after it, the entire string must consist of digits for ^\d+$ to be able to match.
End of String or Before Ending Newline: \Z The \Z anchor specifies that a match must occur at the end of the input string, or before \n at the end of the input string. It is identical to the $ anchor, except that \Z ignores the RegexOptions. Multiline option.
To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).
Take this regular expression: /^[^abc]/ . This will match any single character at the beginning of a string, except a, b, or *c. If you add a * after it – /^[^abc]*/ – the regular expression will continue to add each subsequent character to the result, until it meets either an a , or b , or c .
The negative zero-width lookahead is (?!pattern)
.
It's mentioned part-way down the re module documentation page.
(?!...)
Matches if ... doesn’t match next. This is a negative lookahead assertion. For example, Isaac (?!Asimov) will match 'Isaac ' only if it’s not followed by 'Asimov'.
So you could use it to match any number of words after a key, but not a key using something like (?!\S+:)\S+
.
And the complete code would look like this:
regex = re.compile(r'''
[\S]+: # a key (any word followed by a colon)
(?:
\s # then a space in between
(?!\S+:)\S+ # then a value (any word not followed by a colon)
)+ # match multiple values if present
''', re.VERBOSE)
matches = regex.findall(my_str)
Which gives
['key1: val1-words ', 'key2: val2-words ', 'key3: val3-words']
If you print the key/values using:
for match in matches:
print match
It will print:
key1: val1-words
key2: val2-words
key3: val3-words
Or using your updated example, it would print:
Thème: O sombres héros
Contraintes: sous titrés
Author: nicoalabdou
Tags: wakatanka productions court métrage kino session humour cantat bertrand noir désir sombres héros mer medine marie trintignant femme droit des femmes nicoalabdou pute soumise
Posted: 06 June 2009
Rating: 1.3
Votes: 3
You could turn each key/value pair into a dictionary using something like this:
pairs = dict([match.split(':', 1) for match in matches])
which would make it easier to look up only the keys (and values) you want.
More info:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With