Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I specify Cyrillic character ranges in a Python 3.2 regex?

Once upon a time, I found this question interesting.

Today I decided to play around with the text of that book.

I want to use the regular expression in this script. When I use the script on Cyrillic text, it wipes out all of the Cyrillic characters, leaving only punctuation and whitespace.

#!/usr/bin/env python3.2
# coding=UTF-8

import sys, re

for file in sys.argv[1:]:
    f = open(file)
    fs = f.read()
    regexnl = re.compile('[^\s\w.,?!:;-]')
    rstuff = regexnl.sub('', f)
    f.close()
    print(rstuff)

Something very similar has already been done in this answer.

Basically, I just want to be able to specify a set of characters that are not alphabetic, alphanumeric, or punctuation or whitespace.

like image 483
magnetar Avatar asked Jun 11 '12 13:06

magnetar


1 Answers

This doesn't exactly answer your question, but the regex module has much much better unicode support than the built-in re module. e.g. regex supports the \p{Cyrillic} property and its negation \P{Cyrillic} (as well as a huge number of other unicode properties). Also, it handles unicode case-insensitivity correctly.

like image 109
huon Avatar answered Oct 19 '22 09:10

huon