How to ensure all string literals are unicode in python

Question

I have a fairly large python code base to go through. It's got an issue where some string literals are strings and others are unicode. And this causes bugs. I am trying to convert everything to unicode. I was wondering if there is a tool that can convert all literals to unicode. I.e. if it found something like this:

print "result code %d" % result['code']

to:

print u"result code %d" % result[u'code']

If it helps I use PyCharm (in case there is an extension that does this), however I am would be happy to use a command like too as well. Hopefully such a tool exists.

unutbu · Accepted Answer

You can use tokenize.generate_tokens break the string representation of Python code into tokens. tokenize also classifies the tokens for you. Thus you can identify string literals in Python code.

It is then not hard to manipulate the tokens, adding 'u' where desired:

import tokenize
import token
import io
import collections

class Token(collections.namedtuple('Token', 'num val start end line')):
    @property
    def name(self):
        return token.tok_name[self.num]

def change_str_to_unicode(text):    
    result = text.splitlines()
    # Insert a dummy line into result so indexing result
    # matches tokenize's 1-based indexing
    result.insert(0, '')
    changes = []
    for tok in tokenize.generate_tokens(io.BytesIO(text).readline):
        tok = Token(*tok)
        if tok.name == 'STRING' and not tok.val.startswith('u'):
            changes.append(tok.start)

    for linenum, s in reversed(changes):
        line = result[linenum]
        result[linenum] = line[:s] + 'u' + line[s:]
    return '
'.join(result[1:])

text = '''print "result code %d" % result['code']
# doesn't touch 'strings' in comments
'handles multilines' + \
'okay'
u'Unicode is not touched'
'''

print(change_str_to_unicode(text))

yields

print u"result code %d" % result[u'code']
# doesn't touch 'strings' in comments
u'handles multilines' + u'okay'
u'Unicode is not touched'

How to ensure all string literals are unicode in python

Tags:

python

unicode

unicode-literals

mmopy

1 Answers

unutbu

Recent Activity

Donate For Us

How to ensure all string literals are unicode in python

Tags:

python

unicode

unicode-literals

mmopy

1 Answers

unutbu

Related questions

Recent Activity

Donate For Us