Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

JavaScript print all used Unicode characters

I am trying to make JavaScript print all Unicode characters. According to my research, there are 1,114,112 Unicode characters.

A script like the following could work:

for(i = 0; i < 1114112; i++) 
    console.log(String.fromCharCode(i));

But I found out that only 10% of the 1,114,112 Unicode characters are used.

How can I can I only print the used unicode characters?

like image 369
Progo Avatar asked Mar 29 '14 22:03

Progo


People also ask

How to check Unicode characters in JavaScript?

charCodeAt() is UTF-16, codePointAt() is Unicode. charCodeAt() returns a number between 0 and 65535. Both methods return an integer representing the UTF-16 code of a character, but only codePointAt() can return the full value of a Unicode value greather 0xFFFF (65535).

Does JavaScript use UTF 8?

Popular encodings are UTF-8, UTF-16 and UTF-32. Most JavaScript engines use UTF-16 encoding, so let's detail into UTF-16.


2 Answers

As Jukka said, JavaScript has no built-in way of knowing whether a given Unicode code point has been assigned a symbol yet or not.

There is still a way to do what you want, though.

I’ve written several scripts that parse the Unicode database and create separate data files for each category, property, script, block, etc. in Unicode. I’ve also created an HTTP API that allows you to programmatically get all code points (i.e. an array of numbers) in a given Unicode category, or all symbols (i.e. an array of strings for each character) with a given Unicode property, or a regular expression with that matches any symbols in a certain Unicode script.

For example, to get an array of strings that contains one item for each Unicode code point that has been assigned a symbol in Unicode v6.3.0, you could use the following URL:

http://mathias.html5.org/data/unicode/format?version=6.3.0&property=Assigned&type=symbols&prepend=window.symbols%20%3D%20&append=%3B

Note that you can prepend and append anything you like to the output by tweaking the URL parameters, to make it easier to reuse the data in your own scripts. An example HTML page that console.log()s all these symbols, as you requested, could be written as follows:

<!DOCTYPE html>
<meta charset="utf-8">
<title>All assigned Unicode v6.3.0 symbols</title>
<script src="http://mathias.html5.org/data/unicode/format?version=6.3.0&property=Assigned&type=symbols&prepend=window.symbols%20%3D%20&append=%3B"></script>
<script>
  window.symbols.forEach(function(symbol) {
    // Do what you want to do with `symbol` here, e.g.
    console.log(symbol);
  });
</script>

Demo. Note that since this is a lot of data, you can expect your DevTools console to become slow when opening this page.


Update: Nowadays, you should use Unicode data packages such as unicode-11.0.0 instead. In Node.js, you can then do the following:

const symbols = require('unicode-11.0.0/Binary_Property/Assigned/symbols.js');
console.log(symbols);

// Or, to get the code points:
require('unicode-11.0.0/Binary_Property/Assigned/code-points.js');

// Or, to get a regular expression that only matches these characters:
require('unicode-11.0.0/Binary_Property/Assigned/regex.js');
like image 92
Mathias Bynens Avatar answered Oct 31 '22 09:10

Mathias Bynens


There is no direct way in JavaScript to find out whether a code point is assigned to a character or not, which appears to be the question here. You need information extracted from suitable sources, and this information needs to be updated whenever new characters are assigned in new versions of Unicode.

There are 1,114,112 code points in Unicode. The Unicode standard assigns to each code point the property gc, General Category. If the value of this property is anything but Cs, Co, or Cn, then the code point is assigned to a character. (Code points with gc equal to Co are Private Use code points, to which no character is assigned, but they may be used for characters by private agreements.)

What you would need to do is to get a copy of some relevant files in the Unicode character database (just a collection of files in specific formats, really) and write code that reads it and generates information about assigned code points. For the purposes of printing all Unicode characters, it might be best to generate the information as an array of ranges of assigned codepoints. And this would need to be repeated when the standard is updated with new characters.

Even the rest isn’t trivial. You would need to decide what it means to print a character. Some characters are control characters that may have an effect such as causing a newline, but lacking a visible glyph. Some (spaces) have empty glyphs. Some (combining marks) are meant to be rendered as marks attached to preceding character, though they have conventional renderings as “standalone” characters, too. Some are meant to take essentially different shapes depending on nearest context; they may have isolated forms, too, but just writing a character after another by no means guarantees that an isolated form is used.

Then there’s the problem of fonts. No single font can contain all Unicode characters, so you would need to find a collection of fonts that cover all of Unicode when used together, preferably so that they stylistically match somehow.

So if you are just looking for a compilation of all printable Unicode characters, consider using the Unicode code charts.

like image 43
Jukka K. Korpela Avatar answered Oct 31 '22 07:10

Jukka K. Korpela