Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does SQL determine a character's length in a varchar?

Tags:

After reading the documentation, I understood that there is a one-byte or two-byte length prefix to a varying character so as to determine its length. I understand too that, for a varchar, each character might have a different length in bytes depending on the character itself.

So my question is:

How does the DBMS determine each character's length after it's stored?

Meaning: After a string is stored, let's say it's 4 characters long, and let's suppose that the first character is 1 byte long, the second 2 bytes, the 3rd 3 bytes and the 4th is 4.. How does the DB know how long is each character when retrieving the string so as to read it correctly?

I hope the question is clear, sorry for any English mistakes I made. Thanks

like image 474
lelbil Avatar asked Oct 19 '17 16:10

lelbil


2 Answers

The way UTF-8 works as a variable-length encoding is that the 1-byte characters can only use 7 bits of that byte.

If the high bit is 0, then the byte is a 1-byte character (which happens to be encoded in the same way as the 128 ASCII characters).

If the high bit is 1, then it's a multi-byte character.

enter image description here

Picture from https://en.wikipedia.org/wiki/UTF-8

like image 186
Bill Karwin Avatar answered Sep 24 '22 11:09

Bill Karwin


If you're talking about UTF-8, that's not quite how it works. It uses the highest bit in each byte to indicate that the character continues into the next byte, and can store one, two, three or four byte characters fairly efficiently. This is in contrast to UTF-32 where every character is automatically four bytes, something that is obviously very wasteful for some types of text.

When using UTF-8, or any character set where the characters are a variable number of bytes, there's a disconnect between the length of the string in bytes and the length of the string in characters. In a fixed-length system like Latin1, which is rigidly 8-bit, there's no such drift.

Internally the database is most concerned with the length of a field in terms of bytes. The length in terms of characters is only explicitly exposed when calling functions like LENGTH(), as otherwise it's just a bunch of bytes that, if necessary, can be interpreted as a string.

Historically speaking the database stored the length of a field in bytes in a single byte, then the data itself. That's why VARCHAR(255) is so prevalent: It's the longest string you can represent with a single byte length field. Newer databases like Postgres allow >2GB character fields, so they're using four or more bytes to represent the length.

like image 42
tadman Avatar answered Sep 21 '22 11:09

tadman