Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Shorten an already short string in Java

I'm looking for a way to shorten an already short string as much as possible.

The string is a hostname:port combo and could look like "my-domain.se:2121" or "123.211.80.4:2122".

I know regular compression is pretty much out of the question on strings this short due to the overhead needed and the lack of repetition but I have an idea of how to do it.

Because the alphabet is limited to 39 characters ([a-z][0-9]-:.) every character could fit in 6 bits. This reduce the length with up to 25% compared to ASCII. So my suggestion is somthing along these lines:

  1. Encode the string to a byte array using some kind of custom encoding
  2. Decode the byte array to a UTF-8 or ASCII string (this string will obviously not make any sense).

And then reverse the process to get the original string.

So to my questions:

  1. Could this work?
  2. Is there a better way?
  3. How?
like image 704
Gustav Karlsson Avatar asked Feb 23 '23 00:02

Gustav Karlsson


2 Answers

You could encode the string as base 40 which is more compact than base 64. This will give you 12 such tokens into a 64 bit long. The 40th token could be the end of string marker to give you the length (as it will not be a whole number of bytes any more)

If you use arithmetic encoding, it could be much smaller but you would need a table of frequencies for each token. (using a long list of possible examples)

class Encoder {
  public static final int BASE = 40;
  StringBuilder chars = new StringBuilder(BASE);
  byte[] index = new byte[256];

  {
    chars.append('\0');
    for (char ch = 'a'; ch <= 'z'; ch++) chars.append(ch);
    for (char ch = '0'; ch <= '9'; ch++) chars.append(ch);
    chars.append("-:.");
    Arrays.fill(index, (byte) -1);
    for (byte i = 0; i < chars.length(); i++)
      index[chars.charAt(i)] = i;
  }

  public byte[] encode(String address) {
    try {
      ByteArrayOutputStream baos = new ByteArrayOutputStream();
      DataOutputStream dos = new DataOutputStream(baos);
      for (int i = 0; i < address.length(); i += 3) {
        switch (Math.min(3, address.length() - i)) {
          case 1: // last one.
            byte b = index[address.charAt(i)];
            dos.writeByte(b);
            break;

          case 2:
            char ch = (char) ((index[address.charAt(i+1)]) * 40 + index[address.charAt(i)]);
            dos.writeChar(ch);
            break;

          case 3:
            char ch2 = (char) ((index[address.charAt(i+2)] * 40 + index[address.charAt(i + 1)]) * 40 + index[address.charAt(i)]);
            dos.writeChar(ch2);
            break;
        }
      }
      return baos.toByteArray();
    } catch (IOException e) {
      throw new AssertionError(e);
    }
  }

  public static void main(String[] args) {
    Encoder encoder = new Encoder();
    for (String s : "twitter.com:2122,123.211.80.4:2122,my-domain.se:2121,www.stackoverflow.com:80".split(",")) {
      System.out.println(s + " (" + s.length() + " chars) encoded is " + encoder.encode(s).length + " bytes.");
    }
  }
}

prints

twitter.com:2122 (16 chars) encoded is 11 bytes.
123.211.80.4:2122 (17 chars) encoded is 12 bytes.
my-domain.se:2121 (17 chars) encoded is 12 bytes.
www.stackoverflow.com:80 (24 chars) encoded is 16 bytes.

I leave decoding as an exercise. ;)

like image 54
Peter Lawrey Avatar answered Feb 25 '23 16:02

Peter Lawrey


First of all, IP addresses are designed to fit into 4 bytes and port numbers into 2. The ascii representation is only for humans to read, so it doesn't make sense to do compression on that.

Your idea for compressing domain name strings is doable.

like image 34
Karl Bielefeldt Avatar answered Feb 25 '23 16:02

Karl Bielefeldt