Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Re-order copyright with regex

Tags:

python

regex

I need to position the year of copyright at the beginning of a string. Here are possible inputs I would have:

(c) 2012 10 DC Comics
2012 DC Comics
10 DC Comics. 2012
10 DC Comics , (c) 2012.
10 DC Comics, Copyright 2012
Warner Bros, 2011
Stanford and Sons, Ltd. Inc. (C) 2011. All Rights Reserved.
...etc...

From these inputs, I need to always have the output in the same format -

2012. 10 DC Comics.
2011. Warner Bros.
2011. Stanford and Sons, Ltd. Inc. All Rights Reserved
etc...

How would I do this with a combination of string formatting and regex?

This needs to be cleaned up, but this is what I am currently doing:

### copyright
copyright = value_from_key(sd_wb, 'COPYRIGHT', n).strip()
m = re.search('[0-2][0-9][0-9][0-9]', copyright)
try:
    year = m.group(0)
except AttributeError:
    copyright=''
else:
    copyright = year + ". " + copyright.replace(year,'')
    copyright = copyright.rstrip('.').strip() + '.'

if copyright:
    copyright=copyright.replace('\xc2\xa9 ','').replace('&', '&').replace('(c)','').replace('(C)','').replace('Copyright', '')
    if not copyright.endswith('.'):
        copyright = copyright + '.'
    copyright = copyright.replace('  ', ' ')
like image 858
David542 Avatar asked Mar 12 '12 22:03

David542


2 Answers

This program:

from __future__ import print_function
import re

tests = (
    '(c) 2012 DC Comics',
    'DC Comics. 2012',
    'DC Comics, (c) 2012.',
    'DC Comics, Copyright 2012',
    '(c) 2012 10 DC Comics',
    '10 DC Comics. 2012',
    '10 DC Comics , (c) 2012.',
    '10 DC Comics, Copyright 2012',
    'Warner Bros, 2011',
    'Stanford and Sons, Ltd. Inc. (C) 2011. All Rights Reserved.',
)

for input in tests:
    print("<", input)
    output = re.sub(r'''
            (?P<lead> (?: \S .*? \S )?? )
            [\s.,]*
            (?: (?: \( c \) | copyright ) \s+ )?
            (?P<year> (?:19|20)\d\d )
            [\s.,]?
        ''', r"\g<year>. \g<lead>", input, 1, re.I + re.X)
    print(">", output, "\n")

when run under Python 2.7 or 3.2, produces this output:

< (c) 2012 DC Comics
> 2012. DC Comics 

< DC Comics. 2012
> 2012. DC Comics 

< DC Comics, (c) 2012.
> 2012. DC Comics 

< DC Comics, Copyright 2012
> 2012. DC Comics 

< (c) 2012 10 DC Comics
> 2012. 10 DC Comics 

< 10 DC Comics. 2012
> 2012. 10 DC Comics 

< 10 DC Comics , (c) 2012.
> 2012. 10 DC Comics 

< 10 DC Comics, Copyright 2012
> 2012. 10 DC Comics 

< Warner Bros, 2011
> 2011. Warner Bros 

< Stanford and Sons, Ltd. Inc. (C) 2011. All Rights Reserved.
> 2011. Stanford and Sons, Ltd. Inc All Rights Reserved. 

Which appears to be what you were looking for.

like image 163
tchrist Avatar answered Oct 12 '22 23:10

tchrist


How about an answer that doesn't use regex?

tests = (
    '(c) 2012 DC Comics',
    'DC Comics. 2012',
    'DC Comics, (c) 2012.',
    'DC Comics, Copyright 2012',
    '(c) 2012 10 DC Comics',
    '10 DC Comics. 2012',
    '10 DC Comics , (c) 2012.',
    '10 DC Comics, Copyright 2012',
    'Warner Bros, 2011',
    'Stanford and Sons, Ltd. Inc. (C) 2011. All Rights Reserved.',
    )

def reorder_copyright(text):
    year = None
    first = []
    second = []
    words = text.split()
    if words[0].lower() in ('(c)','copyright'):
        year = words[1]
        company = ' '.join(words[2:])
    for i, word in enumerate(words):
        if word.lower() in ('(c)','copyright'):
            year = words[i+1]
            company = ' '.join(words[:i] + words[i+2:])
            break
    else:
        year = words[-1]
        company = ' '.join(words[:-1])
    year = year.strip(' ,.')
    company = company.strip(' ,.')
    return "%s. %s." % (year, company)

if __name__ == '__main__':
    for line in tests:
        print(reorder_copyright(line))
like image 41
Ethan Furman Avatar answered Oct 13 '22 00:10

Ethan Furman