Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Regex and the Copyright Symbol

edit

Funny story, I was cleaning the text before searching in it and part of my cleaning process was removing all of those symbols. Thus the regex I was using was working just fine :) Thank you all for your help anyway!

original

I'm trying to detect the copyright symbol © in a UTF8 text document using python and regex. The problem I'm having is that I have no idea how to represent the copyright symbol in regex.

I'll also be looking for TM and (R) so those would be most appreciated also, though I suspect once I have (C) it'll be simple to determine the others.

I've been valiantly searching for an answer on google but have yet to find anything that works.

If some bright spark can tell me, it would be most appreciated!

Thanks!

like image 947
danspants Avatar asked Feb 28 '13 00:02

danspants


People also ask

How do I make a copyright symbol in Python?

The easiest method is to use Alt codes. If your keyboard has one, hold down the ALT key while pressing the sequence 0169. Put another way, the Alt code keyboard shortcut for the copyright symbol is ALT+0169.

Is RegEx a match in Python?

re.match() function of re in Python will search the regular expression pattern and return the first occurrence. The Python RegEx Match method checks for a match only at the beginning of the string. So, if a match is found in the first line, it returns the match object.


2 Answers

You could convert your text to unicode and use a unicode regex pattern.

These are the three symbols you mentioned:

In [213]: print(u'\N{COPYRIGHT SIGN} \N{TRADE MARK SIGN} \N{REGISTERED SIGN}')
© ™ ®

Here is some utf-8 encoding string:

In [223]: content = u'\N{TRADE MARK SIGN}'.encode('utf-8')

Here we convert it to unicode:

In [224]: text = content.decode('utf-8')

This is a regex search for any of the three symbols:

In [225]: re.search(u'(\N{COPYRIGHT SIGN}|\N{TRADE MARK SIGN}|\N{REGISTERED SIGN})', text)
Out[225]: <_sre.SRE_Match at 0x9a1ebe0>

There are webpages which catalog every unicode character. But there are hundreds of thousands of assigned unicode code points, so it is not feasible to search for characters by browsing manually.

So, I wrote the program below to search by name for unicodes.

import sys
import unicodedata as ud
import re
import argparse
import functools

__usage__ = '''\
unicode_lookup.py -u '\d'    # Shows all unicode symbols that regex match '\d'
unicode_lookup.py number     # Shows all unicode symbols whose name regex matches 'number'
'''


def lookup(name_pat=None, from_num=0, to_num=0x10ffff, unicode_pattern=None,
           category_pattern=None, ignore_unnamed=True,
           combining=False):
    fmt = u"{symbol} {num} {cat} {bi} {w} {comb} {mir} '{name}'"
    print(fmt.format(
        symbol='Symbol', num='Num', name='NAME',
        cat='Category', bi='Bidirectional', w='Width',
        comb='Combining', mir='Mirrored'))
    for num in range(from_num, to_num + 1):
        s = unichr(num)
        if unicode_pattern and not unicode_pattern.match(s):
            continue
        category = ud.category(s)
        if category_pattern and not category_pattern.match(category):
            continue
        try:
            name = ud.name(s)
            if name_pat and not name_pat.search(name):
                continue
        except ValueError:
            if ignore_unnamed:
                continue
            else:
                name = '?'
        bidirectional = ud.bidirectional(s)
        combining_class = ud.combining(s)
        if combining and not combining_class:
            continue
        mirrored = ud.mirrored(s)
        width = ud.east_asian_width(s)
        data = dict(num=num, symbol=s, name=name,
                    cat=category, bi=bidirectional, w=width,
                    comb=combining_class, mir=mirrored)
        print(fmt.format(**data).encode('utf-8'))


def parse_options():
    parser = argparse.ArgumentParser(
        epilog=__usage__,
        formatter_class=argparse.RawDescriptionHelpFormatter)
    parser.add_argument('name_pat',
                        type=functools.partial(re.compile, flags=re.IGNORECASE))
    parser.add_argument('-f', '--from_num', default=0, type=int)
    parser.add_argument('-t', '--to_num', default=0x10ffff, type=int)
    parser.add_argument('-u', '--unicode_pattern',
                        type=functools.partial(re.compile, flags=re.UNICODE))
    parser.add_argument('--category_pattern', type=re.compile)
    parser.add_argument('--show_unnamed', action='store_true')
    parser.add_argument('--combining', action='store_true')
    return parser.parse_args()

if __name__ == '__main__':
    opt = parse_options()
    lookup(name_pat=opt.name_pat, from_num=opt.from_num, to_num=opt.to_num,
           unicode_pattern=opt.unicode_pattern,
           category_pattern=opt.category_pattern,
           ignore_unnamed=not opt.show_unnamed,
           combining=opt.combining)

Running

% unicode_lookup.py "copyright|trade|registered"

yields

Symbol Num Category Bidirectional Width Combining Mirrored 'NAME'
© 169 So ON N 0 0 'COPYRIGHT SIGN'
® 174 So ON A 0 0 'REGISTERED SIGN'
℗ 8471 So ON N 0 0 'SOUND RECORDING COPYRIGHT'
™ 8482 So ON A 0 0 'TRADE MARK SIGN'
like image 98
unutbu Avatar answered Sep 30 '22 19:09

unutbu


Are you looking for something like this?

import re

text = 'StackOverflow® © 2013 - A great place to code™'
for m in re.finditer(r"[©®™]", text):
    print '%02d-%02d: %s' % (m.start(), m.end(), m.group(0))

Output of this is:

13-14: ®
15-16: ©
45-46: ™
like image 40
Jason Sperske Avatar answered Sep 30 '22 17:09

Jason Sperske