For example, let's say I want to delete from the array all continuous segments of 0's longer than 3 bytes
byte a[] = {1,2,3,0,1,2,3,0,0,0,0,4};
byte r[] = magic(a);
System.out.println(r);
result
{1,2,3,0,1,2,3,4}
I want to do something like a regular expression in Java, but on a byte array instead of a String.
Is there something that can help me built-in (or is there a good third party tool), or do I need to work from scratch?
Strings are UTF-16, so converting back and forth isn't a good idea? At least it's a lot of wasted overhead ... right?
byte[] a = {1,2,3,0,1,2,3,0,0,0,0,4};
String s0 = new String(a, "ISO-8859-1");
String s1 = s0.replaceAll("\\x00{4,}", "");
byte[] r = s1.getBytes("ISO-8859-1");
System.out.println(Arrays.toString(r)); // [1, 2, 3, 0, 1, 2, 3, 4]
I used ISO-8859-1 (latin1) because, unlike any other encoding,
every byte in the range 0x00..0xFF
maps to a valid character, and
each of those characters has the same numeric value as its latin1 encoding.
That means the string is the same length as the original byte array, you can match any byte by its numeric value with the \xFF
construct, and you can convert the resulting string back to a byte array without losing information.
I wouldn't try to display the data while it's in string form--although all the characters are valid, many of them are not printable. Also, avoid manipulating the data while it's in string form; you might accidentally do some escape-sequence substitutions or another encoding conversion without realizing it. In fact, I wouldn't recommend doing this kind of thing at all, but that isn't what you asked. :)
Also, be aware that this technique won't necessarily work in other programming languages or regex flavors. You would have to test each one individually.
Though I question whether reg-ex is the right tool for the job, if you do want to use one I'd suggest you just implement a CharSequence wrapper on a byte array. Something like this (I just wrote this directly in, not compiled... but you get the idea).
public class ByteChars
implements CharSequence
...
ByteChars(byte[] arr) {
this(arr,0,arr.length);
}
ByteChars(byte[] arr, int str, int end) {
//check str and end are within range here
strOfs=str;
endOfs=end;
bytes=arr;
}
public char charAt(int idx) {
//check idx is within range here
return (char)(bytes[strOfs+idx]&0xFF);
}
public int length() {
return (endOfs-strOfs);
}
public CharSequence subSequence(int str, int end) {
//check str and end are within range here
return new ByteChars(arr,(strOfs+str,strOfs+end);
}
public String toString() {
return new String(bytes,strOfs,(endOfs-strOfs),"ISO8859_1");
}
regex is not the tool for the job, you will instead need to implement that from scratch
I don't see how regex would be useful to do what you want. One thing you can do is use Run Length Encoding to encode that byte array, replace every ocurrence of "30" (read three 0's) with the empty string, and decode the final string. Wikipedia has a simple Java implementation of it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With