Python Regex and the Copyright Symbol

edit

Funny story, I was cleaning the text before searching in it and part of my cleaning process was removing all of those symbols. Thus the regex I was using was working just fine :) Thank you all for your help anyway!

original

I'm trying to detect the copyright symbol © in a UTF8 text document using python and regex. The problem I'm having is that I have no idea how to represent the copyright symbol in regex.

I'll also be looking for TM and (R) so those would be most appreciated also, though I suspect once I have (C) it'll be simple to determine the others.

I've been valiantly searching for an answer on google but have yet to find anything that works.

If some bright spark can tell me, it would be most appreciated!

Thanks!

947

asked Feb 28 '13 00:02

danspants

2 Answers

You could convert your text to unicode and use a unicode regex pattern.

These are the three symbols you mentioned:

In [213]: print(u'\N{COPYRIGHT SIGN} \N{TRADE MARK SIGN} \N{REGISTERED SIGN}')
© ™ ®

Here is some utf-8 encoding string:

In [223]: content = u'\N{TRADE MARK SIGN}'.encode('utf-8')

Here we convert it to unicode:

In [224]: text = content.decode('utf-8')

This is a regex search for any of the three symbols:

In [225]: re.search(u'(\N{COPYRIGHT SIGN}|\N{TRADE MARK SIGN}|\N{REGISTERED SIGN})', text)
Out[225]: <_sre.SRE_Match at 0x9a1ebe0>

There are webpages which catalog every unicode character. But there are hundreds of thousands of assigned unicode code points, so it is not feasible to search for characters by browsing manually.

So, I wrote the program below to search by name for unicodes.

import sys
import unicodedata as ud
import re
import argparse
import functools

__usage__ = '''\
unicode_lookup.py -u '\d'    # Shows all unicode symbols that regex match '\d'
unicode_lookup.py number     # Shows all unicode symbols whose name regex matches 'number'
'''


def lookup(name_pat=None, from_num=0, to_num=0x10ffff, unicode_pattern=None,
           category_pattern=None, ignore_unnamed=True,
           combining=False):
    fmt = u"{symbol} {num} {cat} {bi} {w} {comb} {mir} '{name}'"
    print(fmt.format(
        symbol='Symbol', num='Num', name='NAME',
        cat='Category', bi='Bidirectional', w='Width',
        comb='Combining', mir='Mirrored'))
    for num in range(from_num, to_num + 1):
        s = unichr(num)
        if unicode_pattern and not unicode_pattern.match(s):
            continue
        category = ud.category(s)
        if category_pattern and not category_pattern.match(category):
            continue
        try:
            name = ud.name(s)
            if name_pat and not name_pat.search(name):
                continue
        except ValueError:
            if ignore_unnamed:
                continue
            else:
                name = '?'
        bidirectional = ud.bidirectional(s)
        combining_class = ud.combining(s)
        if combining and not combining_class:
            continue
        mirrored = ud.mirrored(s)
        width = ud.east_asian_width(s)
        data = dict(num=num, symbol=s, name=name,
                    cat=category, bi=bidirectional, w=width,
                    comb=combining_class, mir=mirrored)
        print(fmt.format(**data).encode('utf-8'))


def parse_options():
    parser = argparse.ArgumentParser(
        epilog=__usage__,
        formatter_class=argparse.RawDescriptionHelpFormatter)
    parser.add_argument('name_pat',
                        type=functools.partial(re.compile, flags=re.IGNORECASE))
    parser.add_argument('-f', '--from_num', default=0, type=int)
    parser.add_argument('-t', '--to_num', default=0x10ffff, type=int)
    parser.add_argument('-u', '--unicode_pattern',
                        type=functools.partial(re.compile, flags=re.UNICODE))
    parser.add_argument('--category_pattern', type=re.compile)
    parser.add_argument('--show_unnamed', action='store_true')
    parser.add_argument('--combining', action='store_true')
    return parser.parse_args()

if __name__ == '__main__':
    opt = parse_options()
    lookup(name_pat=opt.name_pat, from_num=opt.from_num, to_num=opt.to_num,
           unicode_pattern=opt.unicode_pattern,
           category_pattern=opt.category_pattern,
           ignore_unnamed=not opt.show_unnamed,
           combining=opt.combining)

Running

% unicode_lookup.py "copyright|trade|registered"

yields

Symbol Num Category Bidirectional Width Combining Mirrored 'NAME'
© 169 So ON N 0 0 'COPYRIGHT SIGN'
® 174 So ON A 0 0 'REGISTERED SIGN'
℗ 8471 So ON N 0 0 'SOUND RECORDING COPYRIGHT'
™ 8482 So ON A 0 0 'TRADE MARK SIGN'

answered Sep 30 '22 19:09

unutbu

Are you looking for something like this?

import re

text = 'StackOverflow® © 2013 - A great place to code™'
for m in re.finditer(r"[©®™]", text):
    print '%02d-%02d: %s' % (m.start(), m.end(), m.group(0))

Output of this is:

13-14: ®
15-16: ©
45-46: ™

answered Sep 30 '22 17:09

Jason Sperske

Related questions
                            
                                Python GUI App Distribution: written in wxPython, TKinter or QT
                            
                                How to handle uploaded files in webapp2
                            
                                change database (postgresql) in python using psycopg2 dynamically
                            
                                pandas: generate and plot average
                            
                                How to get the coordinates from layout from graphviz?
                            
                                passing variables from python to bash shell script via os.system
                            
                                igraph: why is add_edge function so slow ompared to add_edges?
                            
                                Popen.returncode not working in Python?
                            
                                Python while loops
                            
                                App Engine: Structured Property vs Reference Property for one-to-many relationship
                            
                                Not exporting functions from Python module
                            
                                Rail Fence Cipher- Looking for a better solution
                            
                                Understanding Virtual Environment for Python
                            
                                Behavior of "and" with sets in Python
                            
                                How to call Excel VBA functions and subs using Python win32com?
                            
                                Get pip to work with git and github repository
                            
                                Is there's any python library to output dictionary in beautiful ascii table?
                            
                                python: lower() german umlauts
                            
                                python list of dictionaries find duplicates based on value
                            
                                Differentiate celery, kombu, PyAMQP and RabbitMQ/ironMQ

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python Regex and the Copyright Symbol

Tags:

python

regex

text

utf-8