Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Time complexity of python string index access?

If I'm not mistaken, a Python string is stored in unicode scalars. However, unicode scalars can combine to form other grapheme clusters. Therefore, using memory displacement start + scalarSize * n for string[n] isn't the answer you're looking for.

Does this mean that Python iterates linearly through each scalar to get to the scalar you are looking for? If you have

word = 'caf' + char(65) + char(301) #café

Does Python store this as five scalars and iteratively check if any should be combined before moving on or does it run a check upon insertion and store 'pure' scalars?

Edit: I was confusing Python with another language. Python's print() prints out grapheme clusters but Python's str stores scalars no matter how you input them. So two combined scalars will print as one grapheme cluster which could be the same cluster as another scalar. When you go to call string[0] you'd get the scalar as inserted into the string.

like image 594
lanza Avatar asked Jun 26 '16 21:06

lanza


1 Answers

Python string indexing does not consider grapheme clusters. It works by Unicode code points. I don't think Python actually has anything built-in for working with grapheme clusters.

String indexing takes constant time, but if you want to retrieve the nth grapheme cluster, string indexing won't do that for you.

(People sometimes suggest applying canonical composition to the string, but there are plenty of possible grapheme clusters that still take multiple code points after canonical composition.)

like image 73
user2357112 supports Monica Avatar answered Oct 17 '22 10:10

user2357112 supports Monica