How can I specify Cyrillic character ranges in a Python 3.2 regex?

Question

Once upon a time, I found this question interesting.

Today I decided to play around with the text of that book.

I want to use the regular expression in this script. When I use the script on Cyrillic text, it wipes out all of the Cyrillic characters, leaving only punctuation and whitespace.

#!/usr/bin/env python3.2
# coding=UTF-8

import sys, re

for file in sys.argv[1:]:
    f = open(file)
    fs = f.read()
    regexnl = re.compile('[^\s\w.,?!:;-]')
    rstuff = regexnl.sub('', f)
    f.close()
    print(rstuff)

Something very similar has already been done in this answer.

Basically, I just want to be able to specify a set of characters that are not alphabetic, alphanumeric, or punctuation or whitespace.

huon · Accepted Answer

This doesn't exactly answer your question, but the regex module has much much better unicode support than the built-in re module. e.g. regex supports the \p{Cyrillic} property and its negation \P{Cyrillic} (as well as a huge number of other unicode properties). Also, it handles unicode case-insensitivity correctly.

How can I specify Cyrillic character ranges in a Python 3.2 regex?

Tags:

regex

python-3.x

unicode

magnetar

1 Answers

huon

Recent Activity

Donate For Us

How can I specify Cyrillic character ranges in a Python 3.2 regex?

Tags:

regex

python-3.x

unicode

magnetar

1 Answers

huon

Related questions

Recent Activity

Donate For Us