This is more of an 'interesting' phenomena I encountered in a Python module that I'm trying to understand, rather than a request for help (though a solution would also be useful).
>>> import fuzzy
>>> s = fuzzy.Soundex(4)
>>> a = "apple"
>>> b = a
>>> sdx_a = s(a)
>>> sdx_a
'A140'
>>> a
'APPLE'
>>> b
'APPLE'
Yeah, so the fuzzy module totally violates the immutability of strings in Python. Is it able to do this because it is a C-extension? And does this constitute an error in CPython as well as the module, or even a security risk?
Also, can anyone think of a way to get around this behaviour? I would like to be able to keep the original capitalisation of the string.
Cheers,
Alex
It violates the rules of how ID values and += are supposed to work - the ID values produced with the optimization in place would be not only impossible, but prohibited, with the unoptimized semantics - but the developers care more about people who would see bad concatenation performance and assume Python sucks.
In python, the string data types are immutable. Which means a string value cannot be updated. We can verify this by trying to update a part of the string which will led us to an error. We can further verify this by checking the memory location address of the position of the letters of the string.
In Python, a string is immutable. You cannot overwrite the values of immutable objects. However, you can assign the variable again. It's not modifying the string object; it's creating a new string object.
The strings in Python are immutable and support the buffer interface. It could be efficient to return not the new strings, but the buffers pointing to the parts of the old string when using slices or the . split() method. However, a new string object is constructed each time.
This bug was resolved back in February; update your version.
To answer your question, yes, there are several ways to modify immutable types at the C level. The security implications are unknown, and possibly even unknowable, at this point.
I don't have the fuzzy
module available to test right now, but the following creates a string with a new identity:
>>> a = "hello"
>>> b = ''.join(a)
>>> b
'hello'
>>> id(a), id(b)
(182894286096, 182894559280)
I don't know much about CPython, but it looks like in fuzzy.c
it declares char *cs = s
, where s
is the input to __call__
. It then mutates cs[i]
, which will obviously mutate s[i]
and therefore the original string. This is definitely a bug with Fuzzy and you should file it on the bitbucket. As Greg's answer said, using ''.join(a)
will create a new copy.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With