Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java: Universal BaseN encoder/decoder working with large data sizes

I'm looking for a decent BaseN encoder (with custom charset) in Java, that is not limited by input data size (array of bytes).

Something like this:

https://github.com/mklemm/base-n-codec-java

But for "unlimited" data length without any unnecessary memory/performance penalty and "BigInteger abuse magic". Simply something that works as standard BASE64 encoders, but universally for any base/charset. Any solution, or idea how to achieve that is welcomed.

Maybe, if someone has experiences with apache BaseNCodec:

https://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/binary/BaseNCodec.html

It looked promising, however it's an Abstract class, and available implemetations look harder to make, than start from scratch.


I need it for a binary data to custom character set encoder (where the number of characters in the set is mutable, "ABCDE" = Base5, "ABCDE-+*/." = Base10, ...).
Update: The "Base N Codec" from GitHub (above) seems to be buggy, so I used the following code at the end:

https://dzone.com/articles/base-x-encoding

like image 602
Jolinar Avatar asked Nov 09 '16 12:11

Jolinar


3 Answers

General answer: No. Special case: Yes, for bases a power of 2.

Why? Because thoughts in the Q are in "strong competition" (actually probably "contradiction").

  1. As input, you want to support an unlimited integer in some base N (think of it as a BigIntegerBaseN). As output, you wat to support an unlimited integer in some base M (think of it as a BigIntegerBaseM).
  2. You want to carry out base conversion - which is mathematically defined as a series of (multiplications & additions) and divisions. See http://www.cut-the-knot.org/recurrence/conversion.shtml and https://math.stackexchange.com/questions/48968/how-to-change-from-base-n-to-m.
  3. You want to find a way of calculating such results without doing multiplications and divisions on BigIntegers (in any base of implementation).

Can you determine results of multiplication and division operations without carrying out multiplication and division calculations? NO. It's a contradiction. When you get the results, by definition, you've carried out the calculation.

So it's not a question of can you avoid the calcuations, but a question of how to streamline them.

  • If N and/or M are in bases a power of 2, then multiplication/division can be calculated by simple bit-shifting = same calculation with major stream-lining. That can be done by avoiding BigInteger calcs.
  • Otherwise, you can cache certain repeated calculations, storing interim results in an array or HashMap, then you get the same calculations with streamlining. But BigInteger calcs are still required (but redundant repetitions are avoided).

Hope that helps your approach. :)

like image 127
Glen Best Avatar answered Nov 09 '22 15:11

Glen Best


A base N encoding is quite efficient if N is a power of 2, as then conversion can happen between fixed size groups of digits and a fixed size of bytes.

Base64: 26 - 6 bits per digit, hence 4 digits = 24 bits = 3 bytes.

Otherwise school multiplication must happen over the entire length, resulting in much "BigInteger" calculation.

A bit faster instead of for instance repeatedly multiplying/dividing by the base N, is having an array of powers of N.

For encoding of a byte array to digits you could use N0, N1, N2, N3, ... as byte arrays of lesser or equal lengths, and do repeated subtractions.

As byte is signed, short might be more suited. Say if the highest byte of the number is 98 and the lessequal N-power is 12 then circa 7 is that digit.

For decoding of digits to a byte array the same powers might be used.

Have fun.

like image 20
Joop Eggen Avatar answered Nov 09 '22 13:11

Joop Eggen


You mention two very different approaches. The BaseN algorithm used in Github implementation is using the mathematical notation of converting an integer between bases. This is equivalent to saying that 10 is the same as 12 in base-8 arithmetic or 1010 in base-2 arithmetic. The algorithm interprets the byte stream as a large number and converts to the assigned base.

Base64 is a very different approach, and you can see an example in Wikipedia Base64 page. The algorithm basically splits the input stream into an array of 6 bits to each element. 2^6 = 64, thus the name Base64. It has a table with the 64 different characters and displays each element in the array (6-bit) to the corresponding conversion table.

I think that you need to select one of the two approaches, since they are very different and not compatible with each other. As for the implementation details, if opting for the second method, this would easier to implement I think, since you basically split into fixed-size parts the stream and encode it according to your own table.

The first method can get quite complicated, since arbitrary arithmetic operations rely on quite complex constructs. You can have a look at exist software, again @ Wikipedia' s list of arbitrary-precision arithmetic software.

Realistically, I think at some point you will find it hard to get characters for your conversions (as the base goes up or the number of bits goes up), unless you will be using the whole Unicode alphabet :).

Hope I helped a bit

like image 27
Alexios Tsiaparas Avatar answered Nov 09 '22 13:11

Alexios Tsiaparas