Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python 3 string index lookup is O(1)?

Tags:

python

utf-8

Short story:

Is Python 3 unicode string lookup O(1) or O(n)?

Long story:

Index lookup of a character in a C char array is constant time O(1) because we can with certainty jump to a contiguous memory location:

const char* mystring = "abcdef";
char its_d = mystring[3];

Its the same as saying:

char its_d = *(mystring + 3);

Because we know that sizeof(char) is 1 as C99, and because of ASCII one character fits in one byte.

Now, in Python 3, now that string literals are unicode strings, we have the following:

>>> mystring = 'ab€cd'
>>> len(mystring)
5
>>> mybytes = mystring.encode('utf-8')
>>> len(mybytes)
7
>>> mybytes
b'ab\xe2\x82\xaccd'
>>> mystring[2]
'€'
>>> mybytes[2]
226
>> ord(mystring[2])
8364

Being UTF-8 encoded, byte 2 is > 127 and thus uses a multibyte representation for the character 3.

I cannot other than conclude that a index lookup in a Python string CANNOT be O(1), because of the multibyte representation of characters? That means that mystring[2] is O(n), and that somehow a on-the-fly interpretation of the memory array is being performed ir order to find the character at index? If that's the case, did I missed some relevant documentation stating this?

I made some very basic benchmark but I cannot infer an O(n) behaviour: https://gist.github.com/carlos-jenkins/e3084a07402ccc25dfd0038c9fe284b5

$ python3 lookups.py
Allocating memory...
Go!
String lookup: 0.513942 ms
Bytes lookup : 0.486462 ms

EDIT: Updated with better example.

like image 803
Havok Avatar asked Dec 15 '16 20:12

Havok


People also ask

Can we use index in string Python?

Strings are ordered sequences of character data, 00:15 and the individual characters of a string can be accessed directly using that numerical index. String indexing in Python is zero-based, so the very first character in the string would have an index of 0 , 00:30 and the next would be 1 , and so on.

How do you find the string index?

Java String indexOf() MethodThe indexOf() method returns the position of the first occurrence of specified character(s) in a string. Tip: Use the lastIndexOf method to return the position of the last occurrence of specified character(s) in a string.

Is there indexing for strings?

Because strings, like lists and tuples, are a sequence-based data type, it can be accessed through indexing and slicing.


1 Answers

UTF-8 is the default source encoding for Python. The internal representation uses fixed-size per-character elements in both Python 2 and Python 3. One of the results is that accessing characters in Python (Unicode) string objects by index has O(1) cost.

The code and results you presented do not demonstrate otherwise. You convert a string to a UTF-8-encoded byte sequence, and we all know that UTF-8 uses variable-length code sequences, but none of that says anything about the internal representation of the original string.

like image 178
John Bollinger Avatar answered Sep 25 '22 17:09

John Bollinger