Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java: remove continious segment of zeros from byte array

Tags:

java

arrays

regex

For example, let's say I want to delete from the array all continuous segments of 0's longer than 3 bytes

byte a[] = {1,2,3,0,1,2,3,0,0,0,0,4};
byte r[] = magic(a);
System.out.println(r);

result

{1,2,3,0,1,2,3,4}

I want to do something like a regular expression in Java, but on a byte array instead of a String.

Is there something that can help me built-in (or is there a good third party tool), or do I need to work from scratch?

Strings are UTF-16, so converting back and forth isn't a good idea? At least it's a lot of wasted overhead ... right?

like image 446
Mike Avatar asked Sep 06 '09 23:09

Mike


4 Answers

byte[] a = {1,2,3,0,1,2,3,0,0,0,0,4};
String s0 = new String(a, "ISO-8859-1");
String s1 = s0.replaceAll("\\x00{4,}", "");
byte[] r = s1.getBytes("ISO-8859-1");

System.out.println(Arrays.toString(r)); // [1, 2, 3, 0, 1, 2, 3, 4]

I used ISO-8859-1 (latin1) because, unlike any other encoding,

  • every byte in the range 0x00..0xFF maps to a valid character, and

  • each of those characters has the same numeric value as its latin1 encoding.

That means the string is the same length as the original byte array, you can match any byte by its numeric value with the \xFF construct, and you can convert the resulting string back to a byte array without losing information.

I wouldn't try to display the data while it's in string form--although all the characters are valid, many of them are not printable. Also, avoid manipulating the data while it's in string form; you might accidentally do some escape-sequence substitutions or another encoding conversion without realizing it. In fact, I wouldn't recommend doing this kind of thing at all, but that isn't what you asked. :)

Also, be aware that this technique won't necessarily work in other programming languages or regex flavors. You would have to test each one individually.

like image 142
Alan Moore Avatar answered Nov 19 '22 02:11

Alan Moore


Though I question whether reg-ex is the right tool for the job, if you do want to use one I'd suggest you just implement a CharSequence wrapper on a byte array. Something like this (I just wrote this directly in, not compiled... but you get the idea).

public class ByteChars 
implements CharSequence

...

ByteChars(byte[] arr) {
    this(arr,0,arr.length);
    }

ByteChars(byte[] arr, int str, int end) {
    //check str and end are within range here
    strOfs=str;
    endOfs=end;
    bytes=arr;
    }

public char charAt(int idx) { 
    //check idx is within range here
    return (char)(bytes[strOfs+idx]&0xFF); 
    }

public int length() { 
    return (endOfs-strOfs); 
    }

public CharSequence subSequence(int str, int end) { 
    //check str and end are within range here
    return new ByteChars(arr,(strOfs+str,strOfs+end); 
    }

public String toString() { 
    return new String(bytes,strOfs,(endOfs-strOfs),"ISO8859_1");
    }
like image 21
Lawrence Dol Avatar answered Nov 19 '22 01:11

Lawrence Dol


regex is not the tool for the job, you will instead need to implement that from scratch

like image 37
objects Avatar answered Nov 19 '22 00:11

objects


I don't see how regex would be useful to do what you want. One thing you can do is use Run Length Encoding to encode that byte array, replace every ocurrence of "30" (read three 0's) with the empty string, and decode the final string. Wikipedia has a simple Java implementation of it.

like image 1
João Silva Avatar answered Nov 19 '22 00:11

João Silva