What I have in mind is iterating through a folder to check whether the file names contain any Cyrillic characters, if they do, rename those files to something else.
How could I do this ?
Python 3
This one checks each character of the passed string, whether it's in the Cyrillic block and returns True
if the string has a Cyrillic character in it. Strings in Python3 are unicode by default. The function encodes each character to utf-8 and checks whether this yields two bytes matching the table block that contains Cyrillic characters.
def isCyrillic(filename):
for char in filename:
char_utf8 = char.encode('utf-8') # encode to utf-8
if len(char_utf8) == 2 \ # check if we have 2 bytes and if the
and 0xd0 <= char_utf8[0] <= 0xd3\ # first and second byte point to
and 0x80 <= char_utf8[1] <= 0xbf: # Cyrillic block (unicode U+0400-U+04FF)
return True
return False
Same function using ord()
as suggested in comment
def isCyrillicOrd(filename):
for char in filename:
if 0x0400 <= ord(char) <= 0x04FF: # directly checking unicode code point
return True
return False
Test Directory
cycont
|---- asciifile.txt
|---- кириллфайл.txt
|---- украї́нська.txt
|---- संस्कृत.txt
Test
import os
for (dirpath, dirnames, filenames) in os.walk('G:/cycont'):
for filename in filenames:
print(filename, isCyrillic(filename), isCyrillicOrd(filename))
Output
asciifile.txt False False
кириллфайл.txt True True
украї́нська.txt True True
संस्कृत.txt False False
Python 2:
# -*- coding: utf-8 -*-
def check_value(value):
try:
value.decode('ascii')
except UnicodeDecodeError:
return False
else:
return True
Python 3:
Python 3 'str' object doesn't have the attribute 'decode'. So you can use the encode as follows.
# -*- coding: utf-8 -*-
def check_value(value):
try:
value.encode('ascii')
except UnicodeEncodeError:
return False
else:
return True
Then you can gather your file names, and pass them through the check_value function.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With