Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is string internally stored as individual characters, each character in memory shared by other similar strings?

For example, is the string var1 = 'ROB' stored as 3 memory locations R, O and B each with its own address and the variable var1 points to the memory location R? Then how does it point to O and B?

And do other strings – for example: var2 = 'BOB' – point to the same B and O in memory that var1 refers to?

like image 705
variable Avatar asked Jul 12 '19 07:07

variable


People also ask

How is a string stored in memory?

Strings are stored on the heap area in a separate memory location known as String Constant pool. String constant pool: It is a separate block of memory where all the String variables are held. String str1 = "Hello"; directly, then JVM creates a String object with the given value in a String constant pool.

How are strings internally stored?

They are stored internally as a Unicode sequence with a know codec. That means that they are a sequence of bytes where each character might be one, two, three or four bytes depending on which Unicode page this characters are from.

How are individual characters in a string referenced?

String Indexing Individual characters in a string can be accessed by specifying the string name followed by a number in square brackets ( [] ). String indexing in Python is zero-based: the first character in the string has index 0 , the next has index 1 , and so on.

How string is stored in memory in Python?

A string in Python is just a sequence of Unicode characters enclosed within quotes. Remember that in Python there can be single quotes, double quotes, or even triple single or triple double quotes.


Video Answer


2 Answers

How strings are stored is an implementation detail, but in practice, on the CPython reference interpreter, they're stored as a C-style array of characters. So if the R is at address x, then O is at x+1 (or +2 or +4, depending on the largest ordinal value in the string), and B is at x+2 (or +4 or +8). Because the letters are stored consecutively, knowing where R is (and a flag in the str that says how big each character's storage is) is enough to locate O and B.

'BOB' is at a completely different address, y, and its O and B are contiguous as well. The OB in 'ROB' is utterly unrelated to the OB in 'BOB'.

There is a confusing aspect to this. If you index into the strings, and check the id of the result, it will seem like 'O' has the same address in both strings. But that's only because:

  1. Indexing into a string returns a new string, unrelated to the one being indexed, and
  2. CPython caches length one strings in the latin-1 range, so 'O' is a singleton (no matter how you make it, you get back the cached string)

I'll note that the actual str internals in modern Python are even more complicated than I covered above; a single string might store the same data in up to three different encodings in the same object (the canonical form, and cached version(s) for working with specific Python C APIs). It's not visible from the Python level aside from checking the size with sys.getsizeof though, so it's not worth worrying about in general.

If you really want to head off into the weeds, feel free to read PEP 393: Flexible String Representation which elaborates on the internals of the new str object structure adopted in CPython 3.3.

like image 180
ShadowRanger Avatar answered Oct 21 '22 22:10

ShadowRanger


This is only a partial answer:

  • var1 is a name that refers to a string object 'ROB'.
  • var2 is a name that refers to another string object 'BOB'.

How a string object stores the individual characters, and whether different string objects share the same memory, I cannot answer now in more detail than "sometimes" and "it depends". It has to do with string interning, which may be used.

like image 2
mkrieger1 Avatar answered Oct 21 '22 20:10

mkrieger1