Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Searching text in binary data

Tags:

performance

c

I've a binary data which contains a text. The text is known. What could be a fast method to search for that text:

As an eg.

This is text 1---
!@##$%%#^%&!%^$! <= Assume this line is 3 MB of binary data
Now, This is text 2 ---
!@##$%%#^%&!%^$! <= Assume this line is 2.5 MB of binary data
This is text 3 ---

How can I search for text This is text 2.

Currently I'm doing like:

size_t count = 0;
size_t s_len = strlen("This is text 2");

//Assume data_len is length of the data from which text is to be found and data is pointer (char*) to the start of it.
for(; count < data_len; ++count)
{
    if(!memcmp("This is text 2", data + count, s_len)
    {
         printf("%s\n", "Hurray found you...");
    }
}
  • Is there any other way, more efficient way to do this
  • Will replacing ++count logic with memchr('T') logic help <= Please ignore if this statement is not clear
  • what should be the average case big-O comlexity of memchr
like image 320
Mayank Avatar asked May 26 '11 08:05

Mayank


People also ask

How do you grep words in binary?

You can also use the “grep –a” command combined with the “cat” command as shown below. Let's use the alternative “—binary-files=text” of the “-a” option for the grep command on the 'new.sh' binary file. It shows the same output as we got for the “-a” option.

Does grep work on binary files?

By default, TYPE is binary, and grep normally outputs either a one-line message saying that a binary file matches, or no message if there is no match. If TYPE is without-match, grep assumes that a binary file does not match; this is equivalent to the -I option.

What is text and binary data?

Text files are organized around lines, each of which ends with a newline character ('\n'). The source code files are themselves text files. A binary file is the one in which data is stored in the file in the same way as it is stored in the main memory for processing.


1 Answers

There's nothing in standard C to help you, but there is a GNU extension memmem() that does this:

#define TEXT2 "This is text 2"

char *pos = memmem(data, data_len, TEXT2, sizeof(TEXT2));

if (pos != NULL)
    /* Found it. */

If you need to be portable to systems that don't have this, you could take the glibc implementation of memmem() and incorporate it into your program.

like image 186
caf Avatar answered Sep 26 '22 15:09

caf