Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does a space affect the identity comparison of equal strings? [duplicate]

Tags:

python

I've noticed that adding a space to identical strings makes them compare unequal using is, while the non-space versions compare equal.

a = 'abc'
b = 'abc'
a is b
#outputs: True

a = 'abc abc'
b = 'abc abc'
a is b
#outputs: False

I have read this question about comparing strings with == and is. I think this is a different question because the space character is changing the behavior, not the length of the string. See:

a = 'abc'
b = 'abc'
a is b # True

a = 'gfhfghssrtjyhgjdagtaerjkdhhgffdhfdah'
b = 'gfhfghssrtjyhgjdagtaerjkdhhgffdhfdah'
a is b # True

Why does adding a space to the string change the result of this comparison?

like image 915
midkin Avatar asked Feb 04 '15 19:02

midkin


People also ask

Why can't we use == to compare string objects?

You should not use == (equality operator) to compare these strings because they compare the reference of the string, i.e. whether they are the same object or not. On the other hand, equals() method compares whether the value of the strings is equal, and not the object itself.

What is the proper way to compare 2 strings to see if they are equal?

Instead, to determine whether two strings are equal, use the String. Equals(String, String, StringComparison) method. The following example uses the String. CompareTo method to compare the string1 object to the string2 object.

Why should you not use the == operator to compare two strings for equality What should you use instead?

Do not use the == operator to compare Strings. Note: When comparing two strings in java, we should not use the == or != operators. These operators actually test references, and since multiple String objects can represent the same String, this is liable to give the wrong answer.


1 Answers

The python interpreter caches some strings based on certain criteria, the first abc string is cached and used for both but the second is not. It is the same for small ints from -5 to 256.

Because the strings are interned/cached assigning a and b to "abc" makes a and b point to the same objects in memory so using is, which checks if two objects are actually the same object, returns True.

The second string abc abc is not cached so they are two entirely different object in memory so out identity check using is returns False. This time a is not b. They are both pointing to different objects in memory.

In [43]: a = "abc" # python caches abc
In [44]: b = "abc" # it reuses the object when assigning to b
In [45]: id(a)
Out[45]: 139806825858808    # same id's, same object in memory
In [46]: id(b)
Out[46]: 139806825858808    
In [47]: a = 'abc abc'   # not cached  
In [48]: id(a)
Out[48]: 139806688800984    
In [49]: b = 'abc abc'    
In [50]: id(b)         # different id's different objects
Out[50]: 139806688801208

The criteria for caching strings is if the string only has letters, underscores and numbers in the string so in your case the space does not meet the criteria.

Using the interpreter there is one case where you can end up pointing to the same object even when the string does not meet the above criteria, multiple assignments.

In [51]: a,b  = 'abc abc','abc abc'

In [52]: id(a)
Out[52]: 139806688801768

In [53]: id(b)
Out[53]: 139806688801768

In [54]: a is b
Out[54]: True

Looking codeobject.c source for deciding the criteria we see NAME_CHARS decides what can be interned:

#define NAME_CHARS \
    "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz"

/* all_name_chars(s): true iff all chars in s are valid NAME_CHARS */

static int
all_name_chars(unsigned char *s)
{
    static char ok_name_char[256];
    static unsigned char *name_chars = (unsigned char *)NAME_CHARS;

    if (ok_name_char[*name_chars] == 0) {
        unsigned char *p;
        for (p = name_chars; *p; p++)
            ok_name_char[*p] = 1;
    }
    while (*s) {
        if (ok_name_char[*s++] == 0)
            return 0;
    }
    return 1;
}

A string of length 0 or 1 will always be shared as we can see in the PyString_FromStringAndSize function in the stringobject.c source.

/* share short strings */
    if (size == 0) {
        PyObject *t = (PyObject *)op;
        PyString_InternInPlace(&t);
        op = (PyStringObject *)t;
        nullstring = op;
        Py_INCREF(op);
    } else if (size == 1 && str != NULL) {
        PyObject *t = (PyObject *)op;
        PyString_InternInPlace(&t);
        op = (PyStringObject *)t;
        characters[*str & UCHAR_MAX] = op;
        Py_INCREF(op);
    }
    return (PyObject *) op;
}

Not directly related to the question but for those interested PyCode_New also from the codeobject.c source shows how more strings are interned when building a codeobject once the strings meet the criteria in all_name_chars.

PyCodeObject *
PyCode_New(int argcount, int nlocals, int stacksize, int flags,
       PyObject *code, PyObject *consts, PyObject *names,
       PyObject *varnames, PyObject *freevars, PyObject *cellvars,
       PyObject *filename, PyObject *name, int firstlineno,
       PyObject *lnotab)
{
    PyCodeObject *co;
    Py_ssize_t i;
    /* Check argument types */
    if (argcount < 0 || nlocals < 0 ||
        code == NULL ||
        consts == NULL || !PyTuple_Check(consts) ||
        names == NULL || !PyTuple_Check(names) ||
        varnames == NULL || !PyTuple_Check(varnames) ||
        freevars == NULL || !PyTuple_Check(freevars) ||
        cellvars == NULL || !PyTuple_Check(cellvars) ||
        name == NULL || !PyString_Check(name) ||
        filename == NULL || !PyString_Check(filename) ||
        lnotab == NULL || !PyString_Check(lnotab) ||
        !PyObject_CheckReadBuffer(code)) {
        PyErr_BadInternalCall();
        return NULL;
    }
    intern_strings(names);
    intern_strings(varnames);
    intern_strings(freevars);
    intern_strings(cellvars);
    /* Intern selected string constants */
    for (i = PyTuple_Size(consts); --i >= 0; ) {
        PyObject *v = PyTuple_GetItem(consts, i);
        if (!PyString_Check(v))
            continue;
        if (!all_name_chars((unsigned char *)PyString_AS_STRING(v)))
            continue;
        PyString_InternInPlace(&PyTuple_GET_ITEM(consts, i));
    }

This answer is based on simple assignments using the cpython interpreter, as far as interning in relation to functions or any other functionality outside of simple assignments, that was not asked nor answered.

If anyone with a greater understanding of c code has anything to add feel free to edit.

There is a much more thorough explanation here of the whole string interning.

like image 96
Padraic Cunningham Avatar answered Oct 31 '22 08:10

Padraic Cunningham