Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Counting the byte size of a file encoded in ISO 8859-7 in JavaScript

Background

I am writing an esoteric language called Jolf. It is used on the lovely site codegolf SE. If you don't already know, a lot of challenges are scored in bytes. People have made lots of languages that utilize either their own encoding or a pre-existing encoding.

On the interpreter for my language, I have a byte counter. As you might expect, it counts the number of bytes in the code. Until now, I've been using a UTF-8 en/decoder (utf8.js). I am now using the ISO 8859-7 encoding, which has Greek characters. Nor does the text upload actually work. I need to count the actually bytes contained within an uploaded file. Also, is there a way to read the contents of said encoded file?

Question

Given a file encoded in ISO 8859-7 obtained from an <input> element on the page, is there any way to obtain the number of bytes contained in that file? And, given "plaintext" (i.e. text put directly into a <textarea>), how might I count the bytes in that as if it was encoded in ISO 8859-7?

What I've tried

The input element is called isogreek. The file resides in the <input> element. The content is ΦX族, a Greek character, a latin character (each of which should be a byte) and a Chinese character, which should be more than one byte (?).

isogreek.files[0].size;      // is 3; should be more.

var reader = new FileReader();
reader.readAsBinaryString(isogreek.files[0]);      // corrupts the string to `ÖX?`
reader.readAsText(isogreek.files[0]);              // �X?
reader.readAsText(isogreek.files[0],"ISO 8859-7"); // �X?
like image 689
Conor O'Brien Avatar asked Jan 13 '16 23:01

Conor O'Brien


2 Answers

Extended from this comment.

As @pvg mentioned in the comments, the string resulting from readAsBinaryString would be correct, but is corrupted for two reasons:

A. The result is encoded in ISO-8859-1. You can use a function to fix this:

function convertFrom1to7(text) {
  // charset is the set of chars in the ISO-8859-7 encoding from 0xA0 and up, encoded with this format:
  // - If the character is in the same position as in ISO-8859-1/Unicode, use a "!".
  // - If the character is a Greek char with 720 subtracted from its char code, use a ".".
  // - Otherwise, use \uXXXX format.
  var charset = "!\u2018\u2019!\u20AC\u20AF!!!!.!!!!\u2015!!!!...!...!.!....................!............................................!";
  var newtext = "", newchar = "";
  for (var i = 0; i < text.length; i++) {
    var char = text[i];
    newchar = char;
    if (char.charCodeAt(0) >= 160) {
      newchar = charset[char.charCodeAt(0) - 160];
      if (newchar === "!") newchar = char;
      if (newchar === ".") newchar = String.fromCharCode(char.charCodeAt(0) + 720);
    }
    newtext += newchar;
  }
  return newtext;
} 

B. The Chinese character isn't a part of the ISO-8859-7 charset (because the charset supports up to 256 unique chars, as the table shows). If you want to include arbitrary Unicode characters in a program, you will probably need to do one of these two things:

  1. Count the bytes of that program in i.e. UTF-8 or UTF-16. This can be done pretty easily with the library you linked. However, if you want this to be done automatically, you'll need a function that checks if the content of the textarea is a valid ISO-8859-7 file, like this:
function isValidISO_8859_7(text) {
  var charset = /[\u0000-\u00A0\u2018\u2019\u00A3\u20AC\u20AF\u00A6-\u00A9\u037A\u00AB-\u00AD\u2015\u00B0-\u00B3\u0384-\u0386\u00B7\u0388-\u038A\u00BB\u038C\u00BD\u038E-\u03CE]/;
  var valid = true;
  for (var i = 0; i < text.length; i++) {
    valid = valid && charset.test(text[i]);
  }
  return valid;
}
  1. Create your own, custom variant of ISO-8859-7 that uses a specific byte (or more than one) to signify that the next 2 or 3 bytes belong to a single Unicode char. This can be pretty much as simple or complex as you like, from one char signifying a 2-byte char and one signifying a 3-byter to everything between 80 and 9F setting up for the next few. Here's a basic example that uses 80 as the 2-byter and 81 as the 3-byter (assumes the text is encoded in ISO-8859-1):
function reUnicode(text) {
  var newtext = "";
  for (var i = 0; i < text.length; i++) {
    if (text.charCodeAt(i) === 0x80) {
      newtext += String.fromCharCode((text.charCodeAt(++i) << 8) + text.charCodeAt(++i));
    } else if (text.charCodeAt(i) === 0x81) {
      var charcode = (text.charCodeAt(++i) << 16) + (text.charCodeAt(++i) << 8) + text.charCodeAt(++i) - 65536;
      newtext += String.fromCharCode(0xD800 + (charcode >> 10), 0xDC00 + (charcode & 1023)); // Convert into a UTF-16 surrogate pair
    } else {
      newtext += convertFrom1to7(text[i]);
    }
  }
  return newtext;
}

I can go into either method in more detail if you desire.

like image 153
ETHproductions Avatar answered Sep 28 '22 15:09

ETHproductions


The three characters you gave as an example are decoded in 6 bytes a6 ce e6 58 8f 97 (0x58 = X). Also: JavaScript works with utf16 which results in some funny things like ("abc".length === "ΦX族".length) being true.

You most probably need to go to the full length and check every single character for its length by its code-value. You may also need to check two characters in some cases (utf-32 to utf-16). A BOM needs to be placed and checked, too, if necessary (always necessary if you work with files of unknown sources).

EDIT: added on request:

The encodings of the characters in JavaScript is always in utf-16, a two byte representation of the character. That was all well and nice until they suddenly (ha!) found out that two bytes are not really sufficient for all of the alphabets of the world, so the expanded the Unicode range to four bytes: utf-32.

Well, the Unicode consortium did so but the ECMA committee did not.

It cannot be said that hell broke loose but it is quite close in some circumstances, and one of those is your case because you want to mix one-byte encodings with multiple-byte encodings, different ones even.

One byte fits well in two bytes but three or more bytes do not fit well in two bytes, so the so called surrogates were invented. These surrogates are also the reason why it is not so simple to reverse a string in JavaScript.

As I said: a large can of worms.

like image 43
deamentiaemundi Avatar answered Sep 28 '22 16:09

deamentiaemundi