Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tokenize byte array

I have a array of raw bytes which i need to tokenize to a list of byte array in java. Explained better by the following method declaration.

public static List<byte[]> splitMessage(byte[] rawByte, String tokenDelimiter)

Example runs.

Example Run 1:

Raw byte

byte[] rawBytes = new byte[]{72,118,121,49,85,118,97,113,111,124,44,124,49,48,43,57,48,36,63,49,66,70,22,18,124,44,124,23,27,25,54,24,24,34,44,57,69,66,49,47,66,16,39,35,32,36,30,50,63,124,44,124,16,18,24,64,4,94,124,44,124,19,31,42,55,66,46,34,62,34,37};

tokenDelimiter is |,| (i.e 124,44,124)

So the List returned is as:

Token 1: 72,118,121,49,85,118,97,113,111
Token 2: 49,48,43,57,48,36,63,49,66,70,22,18
Token 3: 23,27,25,54,24,24,34,44,57,69,66,49,47,66,16,39,35,32,36,30,50,63,
Token 4: 16,18,24,64,4,94
Token 5: 19,31,42,55,66,46,34,62,34,37

Example Run 2:

byte[] rawBytes = new byte[]{72,118,121,49,85,118,97,113,111,124,44,124,49,48,43,57,48,36,63,49,66,70,22,18,124,44,124,124,44,124,23,27,25,54,24,24,34,44,57,69,66,49,47,66,16,39,35,32,36,30,50,63,124,44,124,16,18,24,64,4,94,124,44,124,19,31,42,55,66,46,34,62,34,37,124,44,124,124,44,124};

tokenDelimiter is |,| (i.e 124,44,124)

Token 1: 72,118,121,49,85,118,97,113,111
Token 2: 49,48,43,57,48,36,63,49,66,70,22,18
Token 3: <Empty>
Token 3: 23,27,25,54,24,24,34,44,57,69,66,49,47,66,16,39,35,32,36,30,50,63,
Token 4: 16,18,24,64,4,94
Token 5: 19,31,42,55,66,46,34,62,34,37
Token 6: <Empty>
Token 7: <Empty> 

I am able to achive example run from following code snippet. But stuck with tags in the second one.

public static List<byte[]> splitMessageSept19(byte[] rawByte, String tokenDelimiter) throws UnsupportedEncodingException
{
    List<byte[]> tokens = new ArrayList<byte[]>();

    final byte[] byteArray = tokenDelimiter.getBytes("UTF-8");
    final byte byteDelimitorFirstByte  = byteArray[0];

    int bytenum =0 ;
    int lastIndex = 0;
    int storIterator =0;
    for ( int iterator = 0 ; iterator <= rawByte.length ; iterator++ )
    {
        if (iterator == rawByte.length || rawByte[iterator] == byteDelimitorFirstByte)
        {
            storIterator = iterator;
            if ( iterator != rawByte.length )
            {
                for ( int i=0 ; i < byteArray.length ; i++ )
                {
                    if ( rawByte[iterator] == byteArray[i] )
                    {
                        iterator++ ;
                        continue;
                    }
                    else
                    {
                        break;
                    }
                }
            }
            byte[] byteArrayExtracted = new byte[storIterator - lastIndex];
            System.arraycopy(rawByte, lastIndex, byteArrayExtracted, 0, 
                             storIterator - lastIndex);
            lastIndex = iterator ;
            tokens.add(byteArrayExtracted);
            byteArrayExtracted = null;
        }
    }
    for ( byte[] bytetoken : tokens )
    {
        System.out.println("Token received is: " + new String(bytetoken, "UTF-8"));
    }
    return tokens;
}

Has anyone faced a similar problem of tokenizing arrays? Please suggest if there is some other way to tokenize arrays.

Please note: I don't want convert the byte stream to String, tokenize in String format and convert back to bytes. It may have its on problems of encoding.

like image 650
user813063 Avatar asked Sep 19 '12 10:09

user813063


People also ask

What is a raw byte tokenizer?

Raw byte tokenizer. This tokenizer is a vocabulary-free tokenizer which will tokenize text as as raw bytes from [0, 256). Tokenizer outputs can either be padded and truncated with a sequence_length argument, or left un-truncated. The exact output will depend on the rank of the input tensors.

How to tokenize a string in Python?

The tokenize() Function: When we need to tokenize a string, we use this function and we get a Python generator of token objects. Each token object is a simple tuple with the fields. In Python 2.7 one can pass either a unicode string or byte strings to the function tokenizer.tokenize().

How does this tokenizer work?

This tokenizer is a vocabulary-free tokenizer which will tokenize text as as raw bytes from [0, 256). Tokenizer outputs can either be padded and truncated with a sequence_length argument, or left un-truncated. The exact output will depend on the rank of the input tensors.

What is the ByteArray () method in Java?

The bytearray () method returns a bytearray object which is an array of the given bytes. The syntax of bytearray () method is: bytearray ([source encoding [, errors]]]) bytearray () method returns a bytearray object (i.e. array of bytes) which is mutable (can be modified) sequence of integers in the range 0 <= x < 256.


1 Answers

If you use ISO-8859-1 then bytes are preserved as they were originally.

private static final Charset ISO_8859_1 = Charset.forName("ISO-8859-1");

public static List<byte[]> splitMessageSept19(byte[] rawByte, String tokenDelimiter) {
    Pattern pattern = Pattern.compile(tokenDelimiter, Pattern.LITERAL);
    String[] parts = pattern.split(new String(rawByte, ISO_8859_1), -1);
    List<byte[]> ret = new ArrayList<byte[]>();
    for (String part : parts) 
        ret.add(part.getBytes(ISO_8859_1));
    return ret;
}

public static void main(String... args) {
    StringBuilder sb = new StringBuilder();
    for(int i=0;i<256;i++)
        sb.append((char) i);
    byte[] bytes = sb.toString().getBytes(ISO_8859_1);
    List<byte[]> list = splitMessageSept19(bytes, ",");
    for (byte[] b : list) 
        System.out.println(Arrays.toString(b));
}

prints

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43] [45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, -128, -127, -126, -125, -124, -123, -122, -121, -120, -119, -118, -117, -116, -115, -114, -113, -112, -111, -110, -109, -108, -107, -106, -105, -104, -103, -102, -101, -100, -99, -98, -97, -96, -95, -94, -93, -92, -91, -90, -89, -88, -87, -86, -85, -84, -83, -82, -81, -80, -79, -78, -77, -76, -75, -74, -73, -72, -71, -70, -69, -68, -67, -66, -65, -64, -63, -62, -61, -60, -59, -58, -57, -56, -55, -54, -53, -52, -51, -50, -49, -48, -47, -46, -45, -44, -43, -42, -41, -40, -39, -38, -37, -36, -35, -34, -33, -32, -31, -30, -29, -28, -27, -26, -25, -24, -23, -22, -21, -20, -19, -18, -17, -16, -15, -14, -13, -12, -11, -10, -9, -8, -7, -6, -5, -4, -3, -2, -1]

Calling

byte[] rawBytes = new byte[]{72,118,121,49,85,118,97,113,111,124,44,124,49,48,43,57,48,36,63,49,66,70,22,18,124,44,124,124,44,124,23,27,25,54,24,24,34,44,57,69,66,49,47,66,16,39,35,32,36,30,50,63,124,44,124,16,18,24,64,4,94,124,44,124,19,31,42,55,66,46,34,62,34,37,124,44,124,124,44,124};
List<byte[]> list = splitMessageSept19(rawBytes, "|,|");

produces

[72, 118, 121, 49, 85, 118, 97, 113, 111]
[49, 48, 43, 57, 48, 36, 63, 49, 66, 70, 22, 18]
[]
[23, 27, 25, 54, 24, 24, 34, 44, 57, 69, 66, 49, 47, 66, 16, 39, 35, 32, 36, 30, 50, 63]
[16, 18, 24, 64, 4, 94]
[19, 31, 42, 55, 66, 46, 34, 62, 34, 37]
[]
[]
like image 161
Peter Lawrey Avatar answered Oct 11 '22 17:10

Peter Lawrey