Funny story, I was cleaning the text before searching in it and part of my cleaning process was removing all of those symbols. Thus the regex I was using was working just fine :) Thank you all for your help anyway!
I'm trying to detect the copyright symbol © in a UTF8 text document using python and regex. The problem I'm having is that I have no idea how to represent the copyright symbol in regex.
I'll also be looking for TM and (R) so those would be most appreciated also, though I suspect once I have (C) it'll be simple to determine the others.
I've been valiantly searching for an answer on google but have yet to find anything that works.
If some bright spark can tell me, it would be most appreciated!
Thanks!
The easiest method is to use Alt codes. If your keyboard has one, hold down the ALT key while pressing the sequence 0169. Put another way, the Alt code keyboard shortcut for the copyright symbol is ALT+0169.
re.match() function of re in Python will search the regular expression pattern and return the first occurrence. The Python RegEx Match method checks for a match only at the beginning of the string. So, if a match is found in the first line, it returns the match object.
You could convert your text to unicode and use a unicode regex pattern.
These are the three symbols you mentioned:
In [213]: print(u'\N{COPYRIGHT SIGN} \N{TRADE MARK SIGN} \N{REGISTERED SIGN}')
© ™ ®
Here is some utf-8 encoding string:
In [223]: content = u'\N{TRADE MARK SIGN}'.encode('utf-8')
Here we convert it to unicode:
In [224]: text = content.decode('utf-8')
This is a regex search for any of the three symbols:
In [225]: re.search(u'(\N{COPYRIGHT SIGN}|\N{TRADE MARK SIGN}|\N{REGISTERED SIGN})', text)
Out[225]: <_sre.SRE_Match at 0x9a1ebe0>
There are webpages which catalog every unicode character. But there are hundreds of thousands of assigned unicode code points, so it is not feasible to search for characters by browsing manually.
So, I wrote the program below to search by name for unicodes.
import sys
import unicodedata as ud
import re
import argparse
import functools
__usage__ = '''\
unicode_lookup.py -u '\d' # Shows all unicode symbols that regex match '\d'
unicode_lookup.py number # Shows all unicode symbols whose name regex matches 'number'
'''
def lookup(name_pat=None, from_num=0, to_num=0x10ffff, unicode_pattern=None,
category_pattern=None, ignore_unnamed=True,
combining=False):
fmt = u"{symbol} {num} {cat} {bi} {w} {comb} {mir} '{name}'"
print(fmt.format(
symbol='Symbol', num='Num', name='NAME',
cat='Category', bi='Bidirectional', w='Width',
comb='Combining', mir='Mirrored'))
for num in range(from_num, to_num + 1):
s = unichr(num)
if unicode_pattern and not unicode_pattern.match(s):
continue
category = ud.category(s)
if category_pattern and not category_pattern.match(category):
continue
try:
name = ud.name(s)
if name_pat and not name_pat.search(name):
continue
except ValueError:
if ignore_unnamed:
continue
else:
name = '?'
bidirectional = ud.bidirectional(s)
combining_class = ud.combining(s)
if combining and not combining_class:
continue
mirrored = ud.mirrored(s)
width = ud.east_asian_width(s)
data = dict(num=num, symbol=s, name=name,
cat=category, bi=bidirectional, w=width,
comb=combining_class, mir=mirrored)
print(fmt.format(**data).encode('utf-8'))
def parse_options():
parser = argparse.ArgumentParser(
epilog=__usage__,
formatter_class=argparse.RawDescriptionHelpFormatter)
parser.add_argument('name_pat',
type=functools.partial(re.compile, flags=re.IGNORECASE))
parser.add_argument('-f', '--from_num', default=0, type=int)
parser.add_argument('-t', '--to_num', default=0x10ffff, type=int)
parser.add_argument('-u', '--unicode_pattern',
type=functools.partial(re.compile, flags=re.UNICODE))
parser.add_argument('--category_pattern', type=re.compile)
parser.add_argument('--show_unnamed', action='store_true')
parser.add_argument('--combining', action='store_true')
return parser.parse_args()
if __name__ == '__main__':
opt = parse_options()
lookup(name_pat=opt.name_pat, from_num=opt.from_num, to_num=opt.to_num,
unicode_pattern=opt.unicode_pattern,
category_pattern=opt.category_pattern,
ignore_unnamed=not opt.show_unnamed,
combining=opt.combining)
Running
% unicode_lookup.py "copyright|trade|registered"
yields
Symbol Num Category Bidirectional Width Combining Mirrored 'NAME'
© 169 So ON N 0 0 'COPYRIGHT SIGN'
® 174 So ON A 0 0 'REGISTERED SIGN'
℗ 8471 So ON N 0 0 'SOUND RECORDING COPYRIGHT'
™ 8482 So ON A 0 0 'TRADE MARK SIGN'
Are you looking for something like this?
import re
text = 'StackOverflow® © 2013 - A great place to code™'
for m in re.finditer(r"[©®™]", text):
print '%02d-%02d: %s' % (m.start(), m.end(), m.group(0))
Output of this is:
13-14: ®
15-16: ©
45-46: ™
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With