Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to ensure all string literals are unicode in python

I have a fairly large python code base to go through. It's got an issue where some string literals are strings and others are unicode. And this causes bugs. I am trying to convert everything to unicode. I was wondering if there is a tool that can convert all literals to unicode. I.e. if it found something like this:

print "result code %d" % result['code']

to:

print u"result code %d" % result[u'code']

If it helps I use PyCharm (in case there is an extension that does this), however I am would be happy to use a command like too as well. Hopefully such a tool exists.

like image 759
mmopy Avatar asked Jan 14 '23 22:01

mmopy


1 Answers

You can use tokenize.generate_tokens break the string representation of Python code into tokens. tokenize also classifies the tokens for you. Thus you can identify string literals in Python code.

It is then not hard to manipulate the tokens, adding 'u' where desired:


import tokenize
import token
import io
import collections

class Token(collections.namedtuple('Token', 'num val start end line')):
    @property
    def name(self):
        return token.tok_name[self.num]

def change_str_to_unicode(text):    
    result = text.splitlines()
    # Insert a dummy line into result so indexing result
    # matches tokenize's 1-based indexing
    result.insert(0, '')
    changes = []
    for tok in tokenize.generate_tokens(io.BytesIO(text).readline):
        tok = Token(*tok)
        if tok.name == 'STRING' and not tok.val.startswith('u'):
            changes.append(tok.start)

    for linenum, s in reversed(changes):
        line = result[linenum]
        result[linenum] = line[:s] + 'u' + line[s:]
    return '\n'.join(result[1:])

text = '''print "result code %d" % result['code']
# doesn't touch 'strings' in comments
'handles multilines' + \
'okay'
u'Unicode is not touched'
'''

print(change_str_to_unicode(text))

yields

print u"result code %d" % result[u'code']
# doesn't touch 'strings' in comments
u'handles multilines' + u'okay'
u'Unicode is not touched'
like image 86
unutbu Avatar answered Jan 17 '23 18:01

unutbu