Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the most efficient way to encode an arbitrary GUID into readable ASCII (33-127)?

Tags:

algorithm

guid

The standard string representation of GUID takes about 36 characters. Which is very nice, but also really wasteful. I am wondering, how to encode it in the shortest possible way using all the ASCII characters in the range 33-127. The naive implementation produces 22 characters, simply because 128 bits / 6 bits yields 22.

Huffman encoding is my second best, the only question is how to choose the codes....

The encoding must be lossless, of course.

like image 905
mark Avatar asked May 13 '10 14:05

mark


2 Answers

This is an old question, but I had to solve it in order for a system I was working on to be backward compatible.

The exact requirement was for a client-generated identifier that would be written to the database and stored in a 20-character unique column. It never got shown to the user and was not indexed in any way.

Since I couldn't eliminate the requirement, I really wanted to use a Guid (which is statistically unique) and if I could encode it losslessly into 20 characters, then it would be a good solution given the constraints.

Ascii-85 allows you to encode 4 bytes of binary data into 5 bytes of Ascii data. So a 16 byte guid will just fit into 20 Ascii characters using this encoding scheme. A Guid can have 3.1962657931507848761677563491821e+38 discrete values whereas 20 characters of Ascii-85 can have 3.8759531084514355873123178482056e+38 discrete values.

When writing to the database I had some concerns about truncation so no whitespace characters are included in the encoding. I also had issues with collation, which I addressed by excluding lowercase characters from the encoding. Also, it would only ever be passed through a paramaterized command, so any special SQL characters would be escaped automatically.

I've included the C# code to perform Ascii-85 encoding and decoding in case it helps anyone out there. Obviously, depending on your usage you might need to choose a different character set as my constraints made me choose some unusual characters like 'ß' and 'Ø' - but that's the easy part:

/// <summary>
/// This code implements an encoding scheme that uses 85 printable ascii characters 
/// to encode the same volume of information as contained in a Guid.
/// 
/// Ascii-85 can represent 4 binary bytes as 5 Ascii bytes. So a 16 byte Guid can be 
/// represented in 20 Ascii bytes. A Guid can have 
/// 3.1962657931507848761677563491821e+38 discrete values whereas 20 characters of 
/// Ascii-85 can have 3.8759531084514355873123178482056e+38 discrete values.
/// 
/// Lower-case characters are not included in this encoding to avoid collation 
/// issues. 
/// This is a departure from standard Ascii-85 which does include lower case 
/// characters.
/// In addition, no whitespace characters are included as these may be truncated in 
/// the database depending on the storage mechanism - ie VARCHAR vs CHAR.
/// </summary>
internal static class Ascii85
{
    /// <summary>
    /// 85 printable ascii characters with no lower case ones, so database 
    /// collation can't bite us. No ' ' character either so database can't 
    /// truncate it!
    /// Unfortunately, these limitation mean resorting to some strange 
    /// characters like 'Æ' but we won't ever have to type these, so it's ok.
    /// </summary>
    private static readonly char[] kEncodeMap = new[]
    { 
        '0','1','2','3','4','5','6','7','8','9',  // 10
        'A','B','C','D','E','F','G','H','I','J',  // 20
        'K','L','M','N','O','P','Q','R','S','T',  // 30
        'U','V','W','X','Y','Z','|','}','~','{',  // 40
        '!','"','#','$','%','&','\'','(',')','`', // 50
        '*','+',',','-','.','/','[','\\',']','^', // 60
        ':',';','<','=','>','?','@','_','¼','½',  // 70
        '¾','ß','Ç','Ð','€','«','»','¿','•','Ø',  // 80
        '£','†','‡','§','¥'                       // 85
    };

    /// <summary>
    /// A reverse mapping of the <see cref="kEncodeMap"/> array for decoding 
    /// purposes.
    /// </summary>
    private static readonly IDictionary<char, byte> kDecodeMap;

    /// <summary>
    /// Initialises the <see cref="kDecodeMap"/>.
    /// </summary>
    static Ascii85()
    {
        kDecodeMap = new Dictionary<char, byte>();

        for (byte i = 0; i < kEncodeMap.Length; i++)
        {
            kDecodeMap.Add(kEncodeMap[i], i);
        }
    }

    /// <summary>
    /// Decodes an Ascii-85 encoded Guid.
    /// </summary>
    /// <param name="ascii85Encoding">The Guid encoded using Ascii-85.</param>
    /// <returns>A Guid decoded from the parameter.</returns>
    public static Guid Decode(string ascii85Encoding)
    { 
        // Ascii-85 can encode 4 bytes of binary data into 5 bytes of Ascii.
        // Since a Guid is 16 bytes long, the Ascii-85 encoding should be 20
        // characters long.
        if(ascii85Encoding.Length != 20)
        {
            throw new ArgumentException(
                "An encoded Guid should be 20 characters long.", 
                "ascii85Encoding");
        }

        // We only support upper case characters.
        ascii85Encoding = ascii85Encoding.ToUpper();

        // Split the string in half and decode each substring separately.
        var higher = ascii85Encoding.Substring(0, 10).AsciiDecode();
        var lower = ascii85Encoding.Substring(10, 10).AsciiDecode();

        // Convert the decoded substrings into an array of 16-bytes.
        var byteArray = new[]
        {
            (byte)((higher & 0xFF00000000000000) >> 56),        
            (byte)((higher & 0x00FF000000000000) >> 48),        
            (byte)((higher & 0x0000FF0000000000) >> 40),        
            (byte)((higher & 0x000000FF00000000) >> 32),        
            (byte)((higher & 0x00000000FF000000) >> 24),        
            (byte)((higher & 0x0000000000FF0000) >> 16),        
            (byte)((higher & 0x000000000000FF00) >> 8),         
            (byte)((higher & 0x00000000000000FF)),  
            (byte)((lower  & 0xFF00000000000000) >> 56),        
            (byte)((lower  & 0x00FF000000000000) >> 48),        
            (byte)((lower  & 0x0000FF0000000000) >> 40),        
            (byte)((lower  & 0x000000FF00000000) >> 32),        
            (byte)((lower  & 0x00000000FF000000) >> 24),        
            (byte)((lower  & 0x0000000000FF0000) >> 16),        
            (byte)((lower  & 0x000000000000FF00) >> 8),         
            (byte)((lower  & 0x00000000000000FF)),  
        };

        return new Guid(byteArray);
    }

    /// <summary>
    /// Encodes binary data into a plaintext Ascii-85 format string.
    /// </summary>
    /// <param name="guid">The Guid to encode.</param>
    /// <returns>Ascii-85 encoded string</returns>
    public static string Encode(Guid guid)
    {
        // Convert the 128-bit Guid into two 64-bit parts.
        var byteArray = guid.ToByteArray();
        var higher = 
            ((UInt64)byteArray[0] << 56) | ((UInt64)byteArray[1] << 48) | 
            ((UInt64)byteArray[2] << 40) | ((UInt64)byteArray[3] << 32) |
            ((UInt64)byteArray[4] << 24) | ((UInt64)byteArray[5] << 16) | 
            ((UInt64)byteArray[6] << 8)  | byteArray[7];

        var lower = 
            ((UInt64)byteArray[ 8] << 56) | ((UInt64)byteArray[ 9] << 48) | 
            ((UInt64)byteArray[10] << 40) | ((UInt64)byteArray[11] << 32) |
            ((UInt64)byteArray[12] << 24) | ((UInt64)byteArray[13] << 16) | 
            ((UInt64)byteArray[14] << 8)  | byteArray[15];

        var encodedStringBuilder = new StringBuilder();

        // Encode each part into an ascii-85 encoded string.
        encodedStringBuilder.AsciiEncode(higher);
        encodedStringBuilder.AsciiEncode(lower);

        return encodedStringBuilder.ToString();
    }

    /// <summary>
    /// Encodes the given integer using Ascii-85.
    /// </summary>
    /// <param name="encodedStringBuilder">The <see cref="StringBuilder"/> to 
    /// append the results to.</param>
    /// <param name="part">The integer to encode.</param>
    private static void AsciiEncode(
        this StringBuilder encodedStringBuilder, UInt64 part)
    {
        // Nb, the most significant digits in our encoded character will 
        // be the right-most characters.
        var charCount = (UInt32)kEncodeMap.Length;

        // Ascii-85 can encode 4 bytes of binary data into 5 bytes of Ascii.
        // Since a UInt64 is 8 bytes long, the Ascii-85 encoding should be 
        // 10 characters long.
        for (var i = 0; i < 10; i++)
        {
            // Get the remainder when dividing by the base.
            var remainder = part % charCount;

            // Divide by the base.
            part /= charCount;

            // Add the appropriate character for the current value (0-84).
            encodedStringBuilder.Append(kEncodeMap[remainder]);
        }
    }

    /// <summary>
    /// Decodes the given string from Ascii-85 to an integer.
    /// </summary>
    /// <param name="ascii85EncodedString">Decodes a 10 character Ascii-85 
    /// encoded string.</param>
    /// <returns>The integer representation of the parameter.</returns>
    private static UInt64 AsciiDecode(this string ascii85EncodedString)
    {
        if (ascii85EncodedString.Length != 10)
        {
            throw new ArgumentException(
                "An Ascii-85 encoded Uint64 should be 10 characters long.", 
                "ascii85EncodedString");
        }

        // Nb, the most significant digits in our encoded character 
        // will be the right-most characters.
        var charCount = (UInt32)kEncodeMap.Length;
        UInt64 result = 0;

        // Starting with the right-most (most-significant) character, 
        // iterate through the encoded string and decode.
        for (var i = ascii85EncodedString.Length - 1; i >= 0; i--)
        {
            // Multiply the current decoded value by the base.
            result *= charCount;

            // Add the integer value for that encoded character.
            result += kDecodeMap[ascii85EncodedString[i]];
        }

        return result;
    }
}

Also, here are the unit tests. They aren't as thorough as I'd like, and I don't like the non-determinism of where Guid.NewGuid() is used, but they should get you started:

/// <summary>
/// Tests to verify that the Ascii-85 encoding is functioning as expected.
/// </summary>
[TestClass]
[UsedImplicitly]
public class Ascii85Tests
{
    [TestMethod]
    [Description("Ensure that the Ascii-85 encoding is correct.")]
    [UsedImplicitly]
    public void CanEncodeAndDecodeAGuidUsingAscii85()
    {
        var guidStrings = new[]
        {
            "00000000-0000-0000-0000-000000000000",
            "00000000-0000-0000-0000-0000000000FF",
            "00000000-0000-0000-0000-00000000FF00",
            "00000000-0000-0000-0000-000000FF0000",
            "00000000-0000-0000-0000-0000FF000000",
            "00000000-0000-0000-0000-00FF00000000",
            "00000000-0000-0000-0000-FF0000000000",
            "00000000-0000-0000-00FF-000000000000",
            "00000000-0000-0000-FF00-000000000000",
            "00000000-0000-00FF-0000-000000000000",
            "00000000-0000-FF00-0000-000000000000",
            "00000000-00FF-0000-0000-000000000000",
            "00000000-FF00-0000-0000-000000000000",
            "000000FF-0000-0000-0000-000000000000",
            "0000FF00-0000-0000-0000-000000000000",
            "00FF0000-0000-0000-0000-000000000000",
            "FF000000-0000-0000-0000-000000000000",
            "FF000000-0000-0000-0000-00000000FFFF",
            "00000000-0000-0000-0000-0000FFFF0000",
            "00000000-0000-0000-0000-FFFF00000000",
            "00000000-0000-0000-FFFF-000000000000",
            "00000000-0000-FFFF-0000-000000000000",
            "00000000-FFFF-0000-0000-000000000000",
            "0000FFFF-0000-0000-0000-000000000000",
            "FFFF0000-0000-0000-0000-000000000000",
            "00000000-0000-0000-0000-0000FFFFFFFF",
            "00000000-0000-0000-FFFF-FFFF00000000",
            "00000000-FFFF-FFFF-0000-000000000000",
            "FFFFFFFF-0000-0000-0000-000000000000",
            "00000000-0000-0000-FFFF-FFFFFFFFFFFF",
            "FFFFFFFF-FFFF-FFFF-0000-000000000000",
            "FFFFFFFF-FFFF-FFFF-FFFF-FFFFFFFFFFFF",
            "1000000F-100F-100F-100F-10000000000F"
        };

        foreach (var guidString in guidStrings)
        {
            var guid = new Guid(guidString);
            var encoded = Ascii85.Encode(guid);

            Assert.AreEqual(
                20, 
                encoded.Length, 
                "A guid encoding should not exceed 20 characters.");

            var decoded = Ascii85.Decode(encoded);

            Assert.AreEqual(
                guid, 
                decoded, 
                "The guids are different after being encoded and decoded.");
        }
    }

    [TestMethod]
    [Description(
        "The Ascii-85 encoding is not susceptible to changes in character case.")]
    [UsedImplicitly]
    public void Ascii85IsCaseInsensitive()
    {
        const int kCount = 50;

        for (var i = 0; i < kCount; i++)
        {
            var guid = Guid.NewGuid();

            // The encoding should be all upper case. A reliance 
            // on mixed case will make the generated string 
            // vulnerable to sql collation.
            var encoded = Ascii85.Encode(guid);

            Assert.AreEqual(
                encoded, 
                encoded.ToUpper(), 
                "The Ascii-85 encoding should produce only uppercase characters.");
        }
    }
}

I hope this saves somebody some trouble.

Also, if you find any bugs then let me know ;-)

like image 54
sheikhjabootie Avatar answered Oct 04 '22 14:10

sheikhjabootie


Use Base 85. See section 4.1. Why 85? of A Compact Representation of IPv6 Addresses

An IPv6 address, like a GUID is made up of eight 16-bit pieces.

like image 44
Paul Butcher Avatar answered Oct 04 '22 14:10

Paul Butcher