Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scanning big binary with Erlang

I like to scan larger(>500M) binary files for structs/patterns. I am new to the language an hope that someone can give me start. Actually the files are a database containing Segments. A Segment starts with a fixed sized header followed by a fixed sized optional part followed by the payload/data part of variable lenght. For a first test i just like to log the number of segments in the file. I googled already for some tutorial but found nothing which helped. I need a hint or a tutorial which is not too far from my use case to get started.

Greets Stefan

like image 736
Stefan Gebhardt Avatar asked Jun 13 '12 11:06

Stefan Gebhardt


2 Answers

you need to learn about Bit Syntax and Binary Comprehensions. More useful links to follow: http://www.erlang.org/documentation/doc-5.6/doc/programming_examples/bit_syntax.html, and http://goto0.cubelogic.org/a/90.

You will also need to learn how to process files, reading from files (line-by-line, chunk-by-chunk, at given positions in a file, e.t.c.), writing to files in several ways. The file processing functions are explained here

You can also choose to look at the source code of large file processing libraries within the erlang packages e.g. Disk Log, Dets and mnesia. These libraries heavily read and write into files and their source code is open for you to see.

I hope that helps

like image 171
Muzaaya Joshua Avatar answered Nov 09 '22 02:11

Muzaaya Joshua


Here is a synthesized sample problem: I have a binary file (test.txt) that I want to parse. I want to find all the binary patterns of <<$a, $b, $c>> in the file.

The content of "test.txt":

I arbitrarily decide to choose the string "abc" as my target string for my test. I want to find all the abc's in my testing file.

A sample program (lab.erl):

-module(lab).
-compile(export_all).

find(BinPattern, InputFile) ->
    BinPatternLength = length(binary_to_list(BinPattern)),
    {ok, S} = file:open(InputFile, [read, binary, raw]),
    loop(S, BinPattern, 0, BinPatternLength, 0),
    file:close(S),
    io:format("Done!~n", []).

loop(S, BinPattern, StartPos, Length, Acc) ->
    case file:pread(S, StartPos, Length) of
    {ok, Bin} ->
        case Bin of
        BinPattern ->
            io:format("Found one at position: ~p.~n", [StartPos]),
            loop(S, BinPattern, StartPos + 1, Length, Acc + 1);
        _ ->
            loop(S, BinPattern, StartPos + 1, Length, Acc)
        end;
    eof ->
        io:format("I've proudly found ~p matches:)~n", [Acc])
    end.

Run it:

1> c(lab).
{ok,lab}
2> lab:find(<<"abc">>, "./test.txt").     
Found one at position: 43.
Found one at position: 103.
I've proudly found 2 matches:)
Done!
ok

Note that the code above is not very efficient (the scanning process shifts one byte at a time) and it is sequential (not utilizing all the "cores" on your computer). It is meant only to get you started.

like image 1
Ning Avatar answered Nov 09 '22 02:11

Ning