I have a script that parses the filenames of TV episodes (show.name.s01e02.avi for example), grabs the episode name (from the www.thetvdb.com API) and automatically renames them into something nicer (Show Name - [01x02].avi) The script works fine, that is until you try and use it on files that have Unicode show-names (something I never really thought about, since all the files I have are English, so mostly pretty-much all fall within <code>[a-zA-Z0-9'\-]</code>) How can I allow the regular expressions to match accented characters and the likes? Currently the regex's config section looks like.. <pre class="prettyprint"><code>config['valid_filename_chars'] = """0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!@£$%^&*()_+=-[]{}"'.,<>`~? """ config['valid_filename_chars_regex'] = re.escape(config['valid_filename_chars']) config['name_parse'] = [ # foo_[s01]_[e01] re.compile('''^([%s]+?)[ \._\-]\[[Ss]([0-9]+?)\]_\[[Ee]([0-9]+?)\]?[^\\/]*$'''% (config['valid_filename_chars_regex'])), # foo.1x09* re.compile('''^([%s]+?)[ \._\-]\[?([0-9]+)x([0-9]+)[^\\/]*$''' % (config['valid_filename_chars_regex'])), # foo.s01.e01, foo.s01_e01 re.compile('''^([%s]+?)[ \._\-][Ss]([0-9]+)[\.\- ]?[Ee]([0-9]+)[^\\/]*$''' % (config['valid_filename_chars_regex'])), # foo.103* re.compile('''^([%s]+)[ \._\-]([0-9]{1})([0-9]{2})[\._ -][^\\/]*$''' % (config['valid_filename_chars_regex'])), # foo.0103* re.compile('''^([%s]+)[ \._\-]([0-9]{2})([0-9]{2,3})[\._ -][^\\/]*$''' % (config['valid_filename_chars_regex'])), ] </code></pre>

Use a subrange of <code>[\u0000-\uFFFF]</code> for what you want. You can also use the <code>re.UNICODE</code> compile flag. The docs say that if <code>UNICODE</code> is set, <code>\w</code> will match the characters <code>[0-9_]</code> plus whatever is classified as alphanumeric in the Unicode character properties database. See also http://coding.derkeiler.com/Archive/Python/comp.lang.python/2004-05/2560.html.

Regex and unicode

Tags:

python

regex

unicode

character-properties

I have a script that parses the filenames of TV episodes (show.name.s01e02.avi for example), grabs the episode name (from the www.thetvdb.com API) and automatically renames them into something nicer (Show Name - [01x02].avi)

The script works fine, that is until you try and use it on files that have Unicode show-names (something I never really thought about, since all the files I have are English, so mostly pretty-much all fall within [a-zA-Z0-9'\-])

How can I allow the regular expressions to match accented characters and the likes? Currently the regex's config section looks like..

config['valid_filename_chars'] = """0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!@£$%^&*()_+=-[]{}"'.,<>`~? """ config['valid_filename_chars_regex'] = re.escape(config['valid_filename_chars'])  config['name_parse'] = [     # foo_[s01]_[e01]     re.compile('''^([%s]+?)[ \._\-]\[[Ss]([0-9]+?)\]_\[[Ee]([0-9]+?)\]?[^\\/]*$'''% (config['valid_filename_chars_regex'])),     # foo.1x09*     re.compile('''^([%s]+?)[ \._\-]\[?([0-9]+)x([0-9]+)[^\\/]*$''' % (config['valid_filename_chars_regex'])),     # foo.s01.e01, foo.s01_e01     re.compile('''^([%s]+?)[ \._\-][Ss]([0-9]+)[\.\- ]?[Ee]([0-9]+)[^\\/]*$''' % (config['valid_filename_chars_regex'])),     # foo.103*     re.compile('''^([%s]+)[ \._\-]([0-9]{1})([0-9]{2})[\._ -][^\\/]*$''' % (config['valid_filename_chars_regex'])),     # foo.0103*     re.compile('''^([%s]+)[ \._\-]([0-9]{2})([0-9]{2,3})[\._ -][^\\/]*$''' % (config['valid_filename_chars_regex'])), ]

886

asked Aug 18 '08 09:08

dbr

1 Answers

Use a subrange of [\u0000-\uFFFF] for what you want.

You can also use the re.UNICODE compile flag. The docs say that if UNICODE is set, \w will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.

See also http://coding.derkeiler.com/Archive/Python/comp.lang.python/2004-05/2560.html.

152

answered Oct 06 '22 15:10

Mark Cidade

Related questions
                            
                                Pickle all attributes except one
                            
                                sqlalchemy: get max/min/avg values from a table
                            
                                What OCR options exist beyond Tesseract? [closed]
                            
                                Python naming conventions in decorators
                            
                                Is unsetting a single bit in flags safe with Python variable-length integers?
                            
                                Not able to install packages in Pycharm
                            
                                What's the difference between super() and Parent class name?
                            
                                How does Django handle multiple requests?
                            
                                Atomic file write operations (cross platform)
                            
                                What is os.linesep for?
                            
                                Substitutions inside Sphinx code blocks aren't replaced
                            
                                How do I validate a JSON Schema schema, in Python?
                            
                                How to include third party Python packages in Sublime Text 2 plugins
                            
                                Ipython console in Spyder stuck on "connecting to kernel"
                            
                                Share sqlalchemy models between flask and other apps
                            
                                Safe dereferencing in Python
                            
                                Redis: How to parse a list result
                            
                                TypeError: can only concatenate list (not "str") to list
                            
                                Constructing 3D Pandas DataFrame
                            
                                How to print utf-8 to console with Python 3.4 (Windows 8)?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With