I want to find a specific sequence of Bytes in a binary file using PHP. I represented this sequence in hexadecimal, to avoid typing too many 0s and 1s. The sequence to find is 0x4749524f
. This is the working solution i came up for now:
$mysequence = "4749524f";
$f = fopen($filename, "r") or die("Unable to open file!");
while(!feof($f)) {
$seq = fread($f, 4);
if(bin2hex($seq) == $mysequence) {
echo "found!";
break;
}
else if(!feof($f)) fseek($f, -3, SEEK_CUR);
}
What the algorithm does is simple:
Why do I go back 3 Bytes? Because if this is the content of the file:
0000 4749 524f 0000 01b0 0013
If i don't go back 3 Bytes, I will read 0000 4749
on the first iteration, 524f 0000
on the second, 01b0 0013
on the third and as you can see i missed the sequence.
Problem: It's slow like hell...The application will have to work with files up to 50MB big, so it will take forever to find this sequence.
Is there an optimized function in PHP that would do the job? Is there a faster (not dumb like mine) way to do this?
First of all your $mysequence
is not changing while search, so you can call hex2bin($mysequence)
once and compare it with $seq
directly.
As for doing it really faster, you can try read and search for string in large buffers. Larger buffer => faster search, but more memory needed. Fast code draft, how this should look like:
$mysequence = "4749524f";
$searchBytes = hex2bin($mysequence);
$crossing = 1 - length($searchBytes); // - (length - 1); see below
$buf = ''; $buflen = 10000;
$f = fopen($filename, "r") or die("Unable to open file!");
while(!feof($f))
{
$seq .= fread($f, $buflen);
if(strpos($seq, $searchBytes) === false) // strict comparation here. zero can be returned!
{
// keep last n-1 bytes, because they can be beginning of required sequence
$seq = substr($seq, $crossing);
}
else
{
echo "found!";
break;
}
}
unset($seq); // no need to keep this in memory any more
Doing reads from disk always takes a long time. You can't count on disk caching. That's an OS thing. Instead, do your own "caching", as it were. Read in a long set of bytes, something like maybe 1M (or more). This reduces disk reads. Then search that in memory. When reading the next 1Mbytes, be sure to prepend to it the last 3 bytes of the previous set. Search each set until found. The actual size of your read will need to be a balance between RAM usage and disk reads.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With