Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does String.toLowerCase() actually work? How can one create that functionality manually?

Tags:

javascript

To set a String to lowercase, we just need to call the function toLowerCase() on it. But, for a language I am working in right now, there is no such function, so I would need to create one myself. How is Javascript able to achieve that manually?

like image 495
Holiday Avatar asked Nov 22 '25 20:11

Holiday


1 Answers

For ASCII it's just a simple "take the letter's character code, add 32, and you're done because that's how the numerical codes in ASCII were arranged", but you're asking about JavaScript toLowerCase(), which is a Unicode function: things are complicated.

In unicode land, there aren't just single "uppercase -> lowercase" mappings, there are also "this uppercase character is actually a variant of this other uppercase character" letters, as well as "this uppercase-looking character is actually a ligature and needs to be decomposed into multiple lowercase characters", as well as "this uppercase character has no lowercase equivalent" so in reality a proper toLowerCase function has to examine the Unicode case data to determine how to transform each character in a string to its lowercase equivalent, if one exists.

For example, for Latin characters (often called "ascii" characters, but that's not really true: ASCII is a set of exactly 128 codes, quite a lot of which are not printable) we see data like this:

...
0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;
...

So we see that A, with hex code 0x41, has a lowercase equivalent at code 0x61:

...
0061;LATIN SMALL LETTER A;Ll;0;L;;;;;N;;;0041;;0041
...

So for this set of codes, the rule is:

if (0x41 <= codepoint <= 0x5A) newcodepoint = codepoint + 0x20

However, moving only slightly down the list we see:

...
012A;LATIN CAPITAL LETTER I WITH MACRON;Lu;0;L;0049 0304;;;;N;LATIN CAPITAL LETTER I MACRON;;;012B;
012B;LATIN SMALL LETTER I WITH MACRON;Ll;0;L;0069 0304;;;;N;LATIN SMALL LETTER I MACRON;;012A;;012A
...

Here, the lowercase and uppercase variants are right next to each other. Adding or subtracting 32 is going to be very wrong indeed. Instead, we need to use the rule

if (0x0100 <= codepoint <= 0x012E) newcodepoint = codepoint + 1

So a real toLowerCase is a three stage function:

  1. Find the "mapping set" the character you're looking at is even in, and then
  2. apply that set's rule for mapping between uppercase and lowercase, noting that even though this set exists, it might only map one way, so
  3. if no mapping could be found, follow the official Unicode recommendations on what to do.

Also, note that in step 1 we might have to do more work than you'd think, because not every language allows every letter to be blindly mapped to a single uppercase or lowercase: depending on where in the word the letter is, there might be different uppercase or lowercase equivalents. Just to make things even more fun.

Text transformations are hard, which is why you almost never try to implement your own version: this is one of those subjects that seems stupidly simple at first glance, but when you actually sit down and research it a little, it turns out it's crazy difficult and you really need an entire team of people to write just one function, just so that every edge case is covered and there are no bugs that slipped in because you happened to miss a small rule about some rarely used character.

So to answer your question about how you'd go about implementing this for the language you're working with: you don't. Find a string library that supports your language, and file issues with the browsers in which toLowerCase() doesn't work correctly for your example, because those are bugs that need to be fixed in their implementations.

like image 82
Mike 'Pomax' Kamermans Avatar answered Nov 24 '25 22:11

Mike 'Pomax' Kamermans



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!