I need to position the year of copyright at the beginning of a string. Here are possible inputs I would have:
(c) 2012 10 DC Comics
2012 DC Comics
10 DC Comics. 2012
10 DC Comics , (c) 2012.
10 DC Comics, Copyright 2012
Warner Bros, 2011
Stanford and Sons, Ltd. Inc. (C) 2011. All Rights Reserved.
...etc...
From these inputs, I need to always have the output in the same format -
2012. 10 DC Comics.
2011. Warner Bros.
2011. Stanford and Sons, Ltd. Inc. All Rights Reserved
etc...
How would I do this with a combination of string formatting and regex?
This needs to be cleaned up, but this is what I am currently doing:
### copyright
copyright = value_from_key(sd_wb, 'COPYRIGHT', n).strip()
m = re.search('[0-2][0-9][0-9][0-9]', copyright)
try:
year = m.group(0)
except AttributeError:
copyright=''
else:
copyright = year + ". " + copyright.replace(year,'')
copyright = copyright.rstrip('.').strip() + '.'
if copyright:
copyright=copyright.replace('\xc2\xa9 ','').replace('&', '&').replace('(c)','').replace('(C)','').replace('Copyright', '')
if not copyright.endswith('.'):
copyright = copyright + '.'
copyright = copyright.replace(' ', ' ')
This program:
from __future__ import print_function
import re
tests = (
'(c) 2012 DC Comics',
'DC Comics. 2012',
'DC Comics, (c) 2012.',
'DC Comics, Copyright 2012',
'(c) 2012 10 DC Comics',
'10 DC Comics. 2012',
'10 DC Comics , (c) 2012.',
'10 DC Comics, Copyright 2012',
'Warner Bros, 2011',
'Stanford and Sons, Ltd. Inc. (C) 2011. All Rights Reserved.',
)
for input in tests:
print("<", input)
output = re.sub(r'''
(?P<lead> (?: \S .*? \S )?? )
[\s.,]*
(?: (?: \( c \) | copyright ) \s+ )?
(?P<year> (?:19|20)\d\d )
[\s.,]?
''', r"\g<year>. \g<lead>", input, 1, re.I + re.X)
print(">", output, "\n")
when run under Python 2.7 or 3.2, produces this output:
< (c) 2012 DC Comics
> 2012. DC Comics
< DC Comics. 2012
> 2012. DC Comics
< DC Comics, (c) 2012.
> 2012. DC Comics
< DC Comics, Copyright 2012
> 2012. DC Comics
< (c) 2012 10 DC Comics
> 2012. 10 DC Comics
< 10 DC Comics. 2012
> 2012. 10 DC Comics
< 10 DC Comics , (c) 2012.
> 2012. 10 DC Comics
< 10 DC Comics, Copyright 2012
> 2012. 10 DC Comics
< Warner Bros, 2011
> 2011. Warner Bros
< Stanford and Sons, Ltd. Inc. (C) 2011. All Rights Reserved.
> 2011. Stanford and Sons, Ltd. Inc All Rights Reserved.
Which appears to be what you were looking for.
How about an answer that doesn't use regex?
tests = (
'(c) 2012 DC Comics',
'DC Comics. 2012',
'DC Comics, (c) 2012.',
'DC Comics, Copyright 2012',
'(c) 2012 10 DC Comics',
'10 DC Comics. 2012',
'10 DC Comics , (c) 2012.',
'10 DC Comics, Copyright 2012',
'Warner Bros, 2011',
'Stanford and Sons, Ltd. Inc. (C) 2011. All Rights Reserved.',
)
def reorder_copyright(text):
year = None
first = []
second = []
words = text.split()
if words[0].lower() in ('(c)','copyright'):
year = words[1]
company = ' '.join(words[2:])
for i, word in enumerate(words):
if word.lower() in ('(c)','copyright'):
year = words[i+1]
company = ' '.join(words[:i] + words[i+2:])
break
else:
year = words[-1]
company = ' '.join(words[:-1])
year = year.strip(' ,.')
company = company.strip(' ,.')
return "%s. %s." % (year, company)
if __name__ == '__main__':
for line in tests:
print(reorder_copyright(line))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With