Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How are basic data types (strings and integers) implemented in Python and Perl

Tags:

python

perl

I've been wondering lately how various operations I perform on basic types like strings and integers work in terms of performance, and I figure I could get a much better idea of this if I knew how those basic types were implemented (i.e. I've heard strings and integers are immutable in Python. Does that mean any operation that modifies one character in a string is O(n) because a completely new string has to be created? How about adding numbers?)

I'm curious about this in both Python and Perl, and felt silly asking basically the same question twice, so I'm just wrapping it into one.

If you can include some example operation costs with your answer, that would make it even more helpful.

like image 918
Eli Avatar asked Jun 15 '11 21:06

Eli


People also ask

What are the basic data types in Python 3?

No matter the language, there are a few basic data types you'll use all the time - strings, numbers, booleans, lists, and dictionaries. Those data types, and how to use them in Python 3, are the topic of this blog post series. Today, we're starting with strings. If you're learning Python, you might also want to check out TwilioQuest 3 .

How to create various types of integers in Python?

This is how you would create various integers in Python: In Python 3 there are no different integer types as in Python 2.7, which has int and long int. In Python 3 there is only one integer type, which can store very large numbers. For example: Python has built-in methods for converting between integers and floats.

What is an INT data type in Python?

These can be decimal values, floating point values or even complex numbers. The int data type deals with integers values. This means values like 0, 1, -2 and -15, and not numbers like 0.5, 1.01, -10.8, etc. If you give Python the following code, it will conclude that a is an integer and will assign the int data type to it:

What are strings in Python?

Strings Strings are sequences of character data. The string type in Python is called str. String literals may be delimited using either single or double quotes.


1 Answers

In python, some_string[5] = 'a' would be an error, but the closest equivalent operation, some_string = some_string[5:] + 'a' + some_string[6:] would indeed be O(n). But that's not just true of immutable objects. The same is true for concatenating lists: [1,2,3] + [4,5,6] generates a new list and is O(n).

Adding numbers creates a new value, but generally the resulting value is always the same size in memory, so it's O(1). Of course that only holds with small ints. Once you hit some threshold (20 digits on my machine), suddenly ints take up a variable amount of space. I don't know how that effects asymptotic performance.

However, I found that it doesn't seem to have even a significant effect even near log10(n) == 1000:

>>> times = [timeit.timeit(stmt=stmt.format(10 ** i, 10 ** i), number=100) for i in range(1000)]
>>> sum(times) * 1.0 / len(times)
3.0851364135742186e-06
>>> times[-1]
3.0994415283203125e-06

For strings, the asymptotic performance hit is more immediately apparent:

>>> stmt = 's[:5] + "a" + s[6:]'
>>> setup = 's = "b" * {0}'
>>> times = [timeit.timeit(stmt=stmt, setup=setup.format(i), number=10) for i in range(100000)]
>>> sum(times) * 1.0 / len(times)
6.2434492111206052e-05
>>> times[-1]
0.0001220703125

The execution time for the last operation is well below the average. And the trend is pretty steady:

>>> for t in times[0:100000:10000]:
...     print t
... 
5.00679016113e-06
1.31130218506e-05
2.90870666504e-05
3.88622283936e-05
5.10215759277e-05
6.19888305664e-05
7.41481781006e-05
8.48770141602e-05
9.60826873779e-05
0.000108957290649

Still, operations like these on small strings are pretty cheap.


To expand on your other questions, indexed access is O(1) on both lists and strings.

>>> stmt = 'x = s[{0}] + s[{1}] + s[{2}]'
>>> setup = 's = "a" * {0}'
>>> times = [timeit.timeit(stmt=stmt.format(i / 2, i / 3, i / 4), setup=setup.format(i + 1), number=10) for i in range(1000000)]
>>> sum(times) * 1.0 / len(times)
3.6441037654876707e-06
>>> times[-1]
3.0994415283203125e-06

Likewise with lists:

>>> stmt = 'x = s[{0}] + s[{1}] + s[{2}]'
>>> setup = 's = ["a"] * {0}'
>>> times = [timeit.timeit(stmt=stmt.format(i / 2, i / 3, i / 4), setup=setup.format(i + 1), number=10) for i in range(100000)]
>>> sum(times) * 1.0 / len(times)
2.8617620468139648e-06
>>> times[-1]
1.9073486328125e-06

Slicing copies both strings and lists, and is therefore O(n) with n == len(slice). There is no "good" way to replace one letter of a string, although I want to emphasize that the "bad" way is good enough most of the time. If you want a "good" way, use a different data type; manipulate a list and join it when a string is required; or use a StringIO object. This page has some useful information on concatenating different built-in Python datatypes.

Finally, since you're really interested in the internals, I dug up the struct declaration of PyStringObject in stringobject.h (from version 2.7; 3+ probably looks different). It's about what you'd expect -- a c string with some extra bells and whistles:

typedef struct {
    PyObject_VAR_HEAD

(PyObject_VAR_HEAD is a c preprocessor macro that expands to something like the below depending on rules explained here.)

    Py_ssize_t ob_refcnt;
    PyTypeObject *ob_type;
    Py_ssize_t ob_size;

Continuing...

    long ob_shash;
    int ob_sstate;
    char ob_sval[1];

    /* Invariants:
     *     ob_sval contains space for 'ob_size+1' elements.
     *     ob_sval[ob_size] == 0.
     *     ob_shash is the hash of the string or -1 if not computed yet.
     *     ob_sstate != 0 iff the string object is in stringobject.c's
     *       'interned' dictionary; in this case the two references
     *       from 'interned' to this object are *not counted* in ob_refcnt.
     */
} PyStringObject;

Lists have a similar structure -- c arrays with extra bells and whistles -- but aren't null terminated and generally have extra preallocated storage space.

Needless to say... much of this this applies only to cPython -- PyPy, IronPython, and Jython probably all look totally different!

like image 68
senderle Avatar answered Oct 19 '22 22:10

senderle