If I'm not mistaken, a Python string is stored in unicode scalars. However, unicode scalars can combine to form other grapheme clusters. Therefore, using memory displacement start + scalarSize * n
for string[n]
isn't the answer you're looking for.
Does this mean that Python iterates linearly through each scalar to get to the scalar you are looking for? If you have
word = 'caf' + char(65) + char(301) #café
Does Python store this as five scalars and iteratively check if any should be combined before moving on or does it run a check upon insertion and store 'pure' scalars?
Edit: I was confusing Python with another language. Python's print()
prints out grapheme clusters but Python's str
stores scalars no matter how you input them. So two combined scalars will print as one grapheme cluster which could be the same cluster as another scalar. When you go to call string[0]
you'd get the scalar as inserted into the string.
Python string indexing does not consider grapheme clusters. It works by Unicode code points. I don't think Python actually has anything built-in for working with grapheme clusters.
String indexing takes constant time, but if you want to retrieve the nth grapheme cluster, string indexing won't do that for you.
(People sometimes suggest applying canonical composition to the string, but there are plenty of possible grapheme clusters that still take multiple code points after canonical composition.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With