I know someone explain why when I create equal unicode strings in Python 2.7 they do not point to the same location in memory As in "normal" strings
>>> a1 = 'a'
>>> a2 = 'a'
>>> a1 is a2
True
ok that was what I expected, but
>>> ua1 = u'a'
>>> ua2 = u'a'
>>> ua1 is ua2
False
why? how?
Strings are stored as individual characters in a contiguous memory location. It can be accessed from both directions: forward and backward. Characters are nothing but symbols. Strings are immutable Data Types in Python, which means that once a string is created, it cannot be changed.
Python 2 uses str type to store bytes and unicode type to store unicode code points. All strings by default are str type — which is bytes~ And Default encoding is ASCII. So if an incoming file is Cyrillic characters, Python 2 might fail because ASCII will not be able to handle those Cyrillic Characters.
How much memory does a string take in Python? Adding single characters to a string adds only a byte to the size of the string itself, but every string takes up 40 bytes on its own.
Unicode is a standard encoding system that is used to represent characters from almost all languages. Every Unicode character is encoded using a unique integer code point between 0 and 0x10FFFF . A Unicode string is a sequence of zero or more code points.
I think regular strings are interned but unicode strings are not. This simple test seems to support my theory (Python 2.6.6):
>>> intern("string")
'string'
>>> intern(u"unicode string")
Traceback (most recent call last):
File "<pyshell#18>", line 1, in <module>
intern(u"unicode string")
TypeError: intern() argument 1 must be string, not unicode
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With