I am trying to match key-value pairs that appear at the end of (long) strings. The strings look like (I replaced the "\n") <pre class="prettyprint"><code>my_str = "lots of blah key1: val1-words key2: val2-words key3: val3-words" </code></pre> so I expect matches "key1: val1-words", "key2: val2-words" and "key3: val3-words". <ul> <li>The set of possible key names is known.</li> <li>Not all possible keys appear in every string.</li> <li>At least two keys appear in every string (if that makes it easier to match).</li> <li>val-words can be several words.</li> <li>key-value pairs should only be matched at the end of string.</li> <li>I am using Python re module.</li> </ul> I was thinking <pre class="prettyprint"><code>re.compile('(?:tag1|tag2|tag3):')</code></pre> plus some look-ahead assertion stuff would be a solution. I can't get it right though. How do I do? Thank you. /David Real example string: <pre class="prettyprint"><code>my_str = u'ucourt métrage pour kino session volume 18\nThème: O sombres héros\nContraintes: sous titrés\nAuthor: nicoalabdou\nTags: wakatanka productions court métrage kino session humour cantat bertrand noir désir sombres héros mer medine marie trintignant femme droit des femmes nicoalabdou pute soumise\nPosted: 06 June 2009\nRating: 1.3\nVotes: 3' </code></pre> EDIT: Based on Mikel's solution I am now using the following: <pre class="prettyprint"><code> my_tags = ['\S+'] # gets all tags my_tags = ['Tags','Author','Posted'] # selected tags regex = re.compile(r''' \n # all key-value pairs are on separate lines ( # start group to return (?:{0}): # placeholder for tags to detect '\S+' == all \s # the space between ':' and value .* # the value ) # end group to return '''.format('|'.join(my_tags)), re.VERBOSE) regex.sub('',my_str) # return my_str without matching key-vaue lines regex.findall(my_str) # return matched key-value lines </code></pre>

The negative zero-width lookahead is <code>(?!pattern)</code>. It's mentioned part-way down the re module documentation page. <code>(?!...)</code> <blockquote> Matches if ... doesn’t match next. This is a negative lookahead assertion. For example, Isaac (?!Asimov) will match 'Isaac ' only if it’s not followed by 'Asimov'. </blockquote> So you could use it to match any number of words after a key, but not a key using something like <code>(?!\S+:)\S+</code>. And the complete code would look like this: <pre class="prettyprint"><code>regex = re.compile(r''' [\S]+: # a key (any word followed by a colon) (?: \s # then a space in between (?!\S+:)\S+ # then a value (any word not followed by a colon) )+ # match multiple values if present ''', re.VERBOSE) matches = regex.findall(my_str) </code></pre> Which gives <pre class="prettyprint"><code>['key1: val1-words ', 'key2: val2-words ', 'key3: val3-words'] </code></pre> If you print the key/values using: <pre class="prettyprint"><code>for match in matches: print match </code></pre> It will print: <pre class="prettyprint"><code>key1: val1-words key2: val2-words key3: val3-words </code></pre> Or using your updated example, it would print: <pre class="prettyprint"><code>Thème: O sombres héros Contraintes: sous titrés Author: nicoalabdou Tags: wakatanka productions court métrage kino session humour cantat bertrand noir désir sombres héros mer medine marie trintignant femme droit des femmes nicoalabdou pute soumise Posted: 06 June 2009 Rating: 1.3 Votes: 3 </code></pre> You could turn each key/value pair into a dictionary using something like this: <pre class="prettyprint"><code>pairs = dict([match.split(':', 1) for match in matches]) </code></pre> which would make it easier to look up only the keys (and values) you want. More info: <ul> <li>Python re module documentation</li> <li>Python Regular Expression HOWTO</li> <li>Perl Regular Expression Reference "perlreref"</li> </ul> <hr>

Regex: How to match sequence of key-value pairs at end of string

Tags:

python

regex

key-value

I am trying to match key-value pairs that appear at the end of (long) strings. The strings look like (I replaced the "\n")

my_str = "lots of blah
          key1: val1-words
          key2: val2-words
          key3: val3-words"

so I expect matches "key1: val1-words", "key2: val2-words" and "key3: val3-words".

The set of possible key names is known.
Not all possible keys appear in every string.
At least two keys appear in every string (if that makes it easier to match).
val-words can be several words.
key-value pairs should only be matched at the end of string.
I am using Python re module.

I was thinking

re.compile('(?:tag1|tag2|tag3):')

plus some look-ahead assertion stuff would be a solution. I can't get it right though. How do I do?

Thank you.

/David

Real example string:

my_str = u'ucourt métrage pour kino session volume 18\nThème: O sombres héros\nContraintes: sous titrés\nAuthor: nicoalabdou\nTags: wakatanka productions court métrage kino session humour cantat bertrand noir désir sombres héros mer medine marie trintignant femme droit des femmes nicoalabdou pute soumise\nPosted: 06 June 2009\nRating: 1.3\nVotes: 3'

EDIT:

Based on Mikel's solution I am now using the following:


my_tags = ['\S+'] # gets all tags
my_tags = ['Tags','Author','Posted'] # selected tags
regex = re.compile(r'''
    \n                     # all key-value pairs are on separate lines
    (                      # start group to return
       (?:{0}):            # placeholder for tags to detect '\S+' == all
        \s                 # the space between ':' and value
       .*                  # the value
    )                      # end group to return
    '''.format('|'.join(my_tags)), re.VERBOSE)

regex.sub('',my_str) # return my_str without matching key-vaue lines
regex.findall(my_str) # return matched key-value lines

393

asked Mar 16 '11 10:03

OG Dude

1 Answers

The negative zero-width lookahead is (?!pattern).

It's mentioned part-way down the re module documentation page.

(?!...)

Matches if ... doesn’t match next. This is a negative lookahead assertion. For example, Isaac (?!Asimov) will match 'Isaac ' only if it’s not followed by 'Asimov'.

So you could use it to match any number of words after a key, but not a key using something like (?!\S+:)\S+.

And the complete code would look like this:

regex = re.compile(r'''
    [\S]+:                # a key (any word followed by a colon)
    (?:
    \s                    # then a space in between
        (?!\S+:)\S+       # then a value (any word not followed by a colon)
    )+                    # match multiple values if present
    ''', re.VERBOSE)

matches = regex.findall(my_str)

Which gives

['key1: val1-words ', 'key2: val2-words ', 'key3: val3-words']

If you print the key/values using:

for match in matches:
    print match

It will print:

key1: val1-words
key2: val2-words
key3: val3-words

Or using your updated example, it would print:

Thème: O sombres héros 
Contraintes: sous titrés 
Author: nicoalabdou 
Tags: wakatanka productions court métrage kino session humour cantat bertrand noir désir sombres héros mer medine marie trintignant femme droit des femmes nicoalabdou pute soumise 
Posted: 06 June 2009 
Rating: 1.3 
Votes: 3

You could turn each key/value pair into a dictionary using something like this:

pairs = dict([match.split(':', 1) for match in matches])

which would make it easier to look up only the keys (and values) you want.

More info:

Python re module documentation
Python Regular Expression HOWTO
Perl Regular Expression Reference "perlreref"

150

answered Oct 14 '22 08:10

Mikel

Related questions
                            
                                Numpy.array indexing question
                            
                                How would I go about playing an alarm sound in python?
                            
                                How can I discover if a program is running from command line or from web?
                            
                                Need an example of a POP3 Server or IMAP Server written in Python
                            
                                Pythonic way to verify parameter is a sequence but not string
                            
                                Which scripting language performs better in vs perl vs python vs ruby? [closed]
                            
                                Django search multiple filters
                            
                                Python (newbie) Parse XML from API call
                            
                                CvSize does not exist?
                            
                                How to load current buffer into Python interpreter in Emacs?
                            
                                Python basic data references, list of same reference
                            
                                Recursive expressions with pyparsing
                            
                                Chaining multiple mapreduce tasks in Hadoop streaming
                            
                                How do I create a pip requirements file for a tarball on my local filesystem?
                            
                                virtualenv, python and subversion
                            
                                How to make python autocompletion display matches?
                            
                                How to produce an exponentially scaled axis?
                            
                                OpenCV + python -- grab frames from a video file
                            
                                Get starred messages from GMail using IMAP4 and python
                            
                                Why does float() fail to convert my string to a float?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With