Handle Turkish uppercase and lowercase correctly, need to modify/override built-in functions?

Tags:

I am working with multilingual text data, among others with Russian using the Cyrillic alphabet and Turkish. I basically have to compare the words in two files my_file and check_file and if the words in my_file can be found in check_file, write them in an output file keeping the meta-information about these words from both input files.

Some words are lowercased while other words are capitalised so I have to lowercase all the words to compare them. As I use Python 3.6.5 and Python 3 uses unicode as default, it handles lowercasing and later on capitalising the words correctly for Cyrillic. For Turkish however, some letters are not handled correctly. Uppercase 'İ' should correspond to lowercase 'i', uppercase 'I' should correspond to lowercase 'ı' and lowercase 'i' should correspond to uppercase 'İ' which is not the case if I type the following in the console:

>>> print('İ'.lower())
i̇  # somewhat not rendered correctly, corresponds to unicode 'i\u0307'
>>> print('I'.lower())
i
>>> print('i'.upper())
I

I am doing as follows (simplified sample code):

# python my_file check_file language

import sys

language = sys.argv[3]

# code to get the files as lists

my_file_list = [['ıspanak', 'N'], ['ısır', 'N'], ['acık', 'V']]
check_file_list = [['109', 'Ispanak', 'food_drink'], ['470', 'Isır', 'action_words'], [409, 'Acık', 'action_words']]

# get the lists as dict
my_dict = {}
check_dict = {}

for l in my_file_list:
    word = l[0].lower()
    pos = l[1]
    my_dict[word] = pos

for l in check_file_list:
    word_id = l[0]
    word = l[1].lower()
    word_cat = l[2]
    check_dict[word] = [word_id, word_cat]

# compare the two dicts
for word, pos in my_dict.items():
    if word in check_dict:
        word_id = check_dict[word][0]
        word_cat = check_dict[word][1]
        print(word, pos, word_id, word_cat)

This gives me only one result but it should give me the three words as result:

acık V 409 action_words

What I've done so far based on this question:

Read the accepted answer which proposes to use PyICU but I want my code to be useable without people having to install stuff so I didn't implement it.
Tried to import locale and locale.setlocale(locale.LC_ALL, 'tr_TR.UTF-8') as mentionned in the question but it didn't change anything.

Implement two functions turkish_lower(self) and turkish_upper(self) for the three problematic letters as described in the second answer which seems to be the only solution:

def turkish_lower(self):
    self = re.sub(r'İ', 'i', self)
    self = re.sub(r'I', 'ı', self)
    self = self.lower()
    return self

def turkish_upper(self):
    self = re.sub(r'i', 'İ', self)
    self = self.upper()
    return self

But how can I use these two functions without having to check if language == 'Turkish' every time? Should I override the built-in functions lower() and upper()? If yes, what is the pythonic way of doing it? Should I implement classes for the various languages I'm working with and override the built-in functions inside the class for Turkish?

557

asked May 02 '18 12:05

Fable

1 Answers

You can create a simple "language aware" string that subclasses str and will do the proper uppercasing and lowercasing, for example:

class LanguageAwareStr(str):
    lang = None


class RussianStr(LanguageAwareStr):
    lang = 'ru'


class TurkishStr(LanguageAwareStr):
    lang = 'tr'

    _case_lookup_upper = {'İ': 'i', 'I': 'ı'}  # lookup uppercase letters
    _case_lookup_lower = {v: k for (k, v) in _case_lookup_upper.items()}

    # here we override the lower() and upper() methods
    def lower(self):
        chars = [self._case_lookup_upper.get(c, c) for c in self]
        result = ''.join(chars).lower()
        cls = type(self)  # so we return a TurkishStr result
        return cls(result)

    def upper(self):
        chars = [self._case_lookup_lower.get(c, c) for c in self]
        result = ''.join(chars).upper()
        cls = type(self)  # so we return a TurkishStr result
        return cls(result)

Then when you read a string, knowing what language it is, you wrap it in the proper LanguageAwareStr subclass, and then just use it regularly:


from langaware import RussianStr, TurkishStr

if language == 'turkish':
    LangStr = TurkishStr  # can also create a dict to lookup the correct class

Then when you read language strings, you simply wrap them in a call to LangStr():

for l in my_file_list:
    word = LangStr(l[0]).lower()
    pos = l[1]
    my_dict[word] = pos

for l in check_file_list:
    word_id = l[0]
    word = LangStr(l[1]).lower()
    word_cat = l[2]
    check_dict[word] = [word_id, word_cat]

128

answered Sep 22 '22 11:09

sagittarian

Related questions
                            
                                Querying ansible global group variables via (python)
                            
                                How to get inertia value for each k-means cluster using scikit-learn?
                            
                                PySpark: How to evaluate AUC of ML recomendation algorithm?
                            
                                A simple web page inside kivy app as a widget
                            
                                requests-like wrapper for flask's test_client
                            
                                Trouble Transferring data from FTP server to S3 via stream using Python
                            
                                How to make requests_cache work with many concurrent requests?
                            
                                How to fit a ARMA-GARCH model in python
                            
                                Keras: convert pretrained weights between theano and tensorflow
                            
                                A simple way to insert a table of contents in a multiple page pdf generated using PdfPages
                            
                                Reloading a Python module per process in the multiprocessing module
                            
                                Retrain Tensorflow final layer but still use previous Imagenet classes
                            
                                correct way to add custom (deep) copying logic to a python class
                            
                                How can I build a python project with osx environment on travis
                            
                                Scipy Sparse Cumsum
                            
                                Create dynamic parameters with pytest?
                            
                                Python: SSLError, bad handshake, Unexpected EOF
                            
                                multiprocessing.Pool spawning more processes than requested only on Google Cloud
                            
                                Why the difference in handling unbound locals in functions versus classes?
                            
                                Can't disable flask/werkzeug logging

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Handle Turkish uppercase and lowercase correctly, need to modify/override built-in functions?

Tags:

python

python-3.x

built-in

turkish

cyrillic

Fable

People also ask

1 Answers

sagittarian

Recent Activity

Donate For Us